Revisiting the Platonic Representation Hypothesis
An Aristotelian View

ICML 2026

¹EPFL · ²University of Basel · ³HSLU

^*Equal contribution

Paper Code Video pip install calibrated-similarity

The Platonic Representation Hypothesis (Huh et al., 2024) argues that as neural networks scale, their representations converge to a shared statistical model of the world. We identify two pervasive confounders in the metrics behind that claim, model width and model depth, and propose a permutation-based null-calibration that removes them. After calibration, the cross-modal trend reported by global similarity metrics largely flattens, while local neighborhood metrics keep their trend. We therefore refine the original hypothesis to what the data shows: across modalities, what we observe converging is shared local neighborhood relationships (apple ≈ pear).

The whole story in one figure. (a, b) Width and depth inflate raw similarity scores even when there is no real signal. Calibration removes the inflation. (c) On real models, calibration separates inflation from real signal. The trend reported by global metrics like CKA was mostly inflation and largely flattens after calibration. The trend reported by local metrics like mutual $k$-NN was already real signal and remains.

The Platonic story

Quantifying the similarity between neural representations is central to understanding learned representation spaces, guiding transfer learning, and connecting artificial representations to neural measurements. The Platonic Representation Hypothesis posits that as neural networks scale, their representations across modalities become increasingly similar, suggesting convergence to a shared statistical model of reality. This claim has motivated follow-up work on scaling, representation alignment, and brain-model comparisons. The trend is consistent across standard metrics, such as CKA, CCA, RSA, mutual $k$-NN, and across model families. The empirical pattern itself is well established. What we re-examine is whether the standard similarity metrics behind it actually support the convergence interpretation drawn from them.

Two confounders distort representational similarity

We identify two pervasive confounders that arise when similarity is measured between neural representations: the model width and the model depth. Both inflate raw scores in a way that grows with model scale, and together they undermine the comparative use of representational similarity without calibration.

The width confounder

Many spectral similarity metrics (linear and kernel CKA, the RV coefficient, CCA/SVCCA/PWCCA, RSA) can be written as functionals of an interaction operator built from two representations, such as the (normalized) cross-covariance $\widetilde{\mathbf{C}} = (n-1)^{-1}\mathbf{X}_{c}^{\top}\mathbf{Y}_{c}$. A natural intuition is that for independent $\mathbf{X}, \mathbf{Y}$ this operator is close to zero and the metric should be too. In high dimensions, this fails. For i.i.d., mean-zero, identity-covariance rows with $\mathbf{X}$ and $\mathbf{Y}$ independent,

$$ \mathbb{E}_{H_0}\!\left[\,\bigl\|\widetilde{\mathbf{C}}\bigr\|_F^2\,\right] \;=\; \frac{d_x d_y}{n-1}, $$

and the corresponding CKA null baseline is $\mathcal{O}(d/n)$. Wider models can appear more aligned simply because their representations live in higher-dimensional spaces. The same inflation appears for any interaction-matrix-based similarity, even though the exact form of the null can differ.

Not every metric reacts to width this way. Neighborhood-based similarities (mutual $k$-NN, cycle-$k$-NN, CKNNA) compare a fixed-size set of nearest neighbors instead of an interaction operator in the feature space. Their null depends on the neighborhood size $k$ rather than on the representation dimension $d$.

Independent Gaussian representations. Raw similarity scores grow with $d$ and shrink with $n$. Calibrated scores collapse to zero. The same pattern holds for any interaction-based similarity.

Pick a construction and drag the width slider. For small $d$ the null score is close to zero and the calibrated score sits on top of it. As $d$ grows, the raw score increases while the calibrated score remains near zero. The same pattern appears under three different constructions: independent Gaussian inputs, shared inputs through random linear projections, and shared inputs through random ReLU MLPs.

raw CKA— calibrated CKA—

Width $d$ = 2

The depth confounder

When comparing two networks $A$ and $B$ of depths $L_A$ and $L_B$, it is generally unknown at which pair of layers the optimal alignment arises. A common practice is therefore to compute the full $L_A \times L_B$ similarity matrix and report a selection-based summary, most often the maximum across layer pairs. Selection introduces a separate inflation. Under $H_0$, the maximum of $M = L_A L_B$ noisy measurements is inflated above the mean: a uniform sub-Gaussian right-tail bound, which holds for any bounded similarity metric via Hoeffding, gives:

$$ \mathbb{E}_{H_0}\!\left[T_{\max}\right] \;\le\; \mu + C\,\sigma\sqrt{\log M}. $$

The reported "alignment" therefore grows monotonically with the search space size, so deeper models can appear more aligned simply because more layer pairs are compared, even when nothing is shared between the representations.

The depth confounder. As the number of compared layer pairs grows, the raw aggregate inflates under the null (left). The corresponding aggregation-aware calibrated score stays flat across the same aggregators (max, row-max-mean, top-$k$-mean) (right).

Both inflations compound

Wider models inflate the per-pair null. Deeper models additionally inflate any selection-based aggregate over those pairs. The two effects stack, and both come from how the metric is computed at a given model scale, not from any agreement between the representations themselves.

How big can nothing look? Drag the width $d$ and the number of compared layer pairs $L$ for two truly independent representations ($n = 128$ samples). The big number on the left is what linear CKA reports under $H_0$. The ink portion of the bar is the per-pair null mean $d/(d+n-1)$. The maroon portion is the extra look-elsewhere lift $C\sigma\sqrt{\log L}$ from taking the max across $L$ layer pairs. Calibration returns the number on the right, every time.

raw CKA under $H_0$ 0.825

after calibration 0.000

width contribution $d/(d+n-1)$ = 0.669
depth lift $C\sigma\sqrt{\log L}$ = +0.156

width $d$ = 256 layer pairs $L$ = 16

The calibration framework

We operationalize the null hypothesis $H_0$ (no relationship between the two representations) by permuting sample correspondences. For $K$ permutations $\pi_k \in \Pi_n$ drawn i.i.d. uniformly, we form null scores $s^{(k)} = s(\mathbf{X}, \pi_k(\mathbf{Y}))$. Together with the observed score $s_{\mathrm{obs}} = s(\mathbf{X}, \mathbf{Y})$, they define a right-tail critical value $\tau_\alpha$ as the $(1-\alpha)$-quantile of the combined set $\{s_{\mathrm{obs}}, s^{(1)}, \ldots, s^{(K)}\}$. For bounded similarity metrics with known maximum $s_{\max}$ (typically $1$), the max-preserving calibrated score is

$$ s_{\mathrm{cal}} \;=\; \max\!\left(\frac{s_{\mathrm{obs}} - \tau_\alpha}{s_{\max} - \tau_\alpha},\; 0\right). $$

$\tau_\alpha$ is the principled zero point. The outer $\max(\cdot, 0)$ enforces $s_{\mathrm{cal}} = 0$ whenever $s_{\mathrm{obs}} \le \tau_\alpha$, i.e. when the observation falls inside the null. The denominator $s_{\max} - \tau_\alpha$ rescales the excess so $s_{\mathrm{cal}}$ lives in $[0, 1]$, with $0$ meaning indistinguishable from chance and $1$ meaning the maximum possible similarity.

A concrete null. Pick a scenario, hit play, and watch $1000$ permutation scores accumulate. The shaded band on the right is the region above the critical value $\tau_\alpha$ at $\alpha = 0.05$. With signal, the observed score sits inside the band and the calibrated score is non-zero. Without signal, the observed score sits inside the null and the calibrated score is zero.

permutations: 0/1000 · p-value: — · $s_{\mathrm{cal}}$: —

The same framework removes both confounders

For the depth confounder, the reported number is a summary of a similarity matrix $\mathbf{S} \in \mathbb{R}^{L_A \times L_B}$, e.g. the maximum across layer pairs $T(\mathbf{S}) = \max_{\ell,\ell'} S_{\ell,\ell'}$. Here the null is the distribution of the same aggregate. For each draw $\pi_k$ we apply the same permutation across all layers, build $\mathbf{S}^{(k)}$, and aggregate to $T^{(k)} = T(\mathbf{S}^{(k)})$. The aggregate critical value is

$$ \tau_\alpha^{\mathrm{agg}} \;=\; \mathrm{quantile}_{1-\alpha}\!\bigl\{\,T_{\mathrm{obs}},\, T^{(1)},\, \ldots,\, T^{(K)}\bigr\}, $$

and the calibrated aggregate keeps the same max-preserving form,

$$ T_{\mathrm{cal}} \;=\; \max\!\left(\frac{T_{\mathrm{obs}} - \tau_\alpha^{\mathrm{agg}}}{s_{\max} - \tau_\alpha^{\mathrm{agg}}},\; 0\right). $$

One framework, two uses. Scalar calibration removes the width confounder. Aggregation-aware calibration additionally removes the depth confounder.

Statistical guarantees

The permutation $p$-value is super-uniform under $H_0$: $\mathbb{P}_{H_0}(p \le \alpha) \le \alpha$ for all $\alpha \in [0, 1]$, under a standard exchangeability condition. By the same argument, $\mathbb{P}_{H_0}(s_{\mathrm{obs}} > \tau_\alpha) \le \alpha$, so the gating rule "$s_{\mathrm{cal}} > 0$" is a finite-sample $\alpha$-level declaration of similarity above chance. The same holds for the aggregate case with $T_{\mathrm{obs}}$ and $\tau_\alpha^{\mathrm{agg}}$. Empirically the bound is tight: Type I error tracks the nominal level across a range of $d/n$ ratios, and detection power tracks the strength of any real signal.

Calibrated test under simulation. Left: the realised Type I rate stays at the nominal $\alpha = 0.05$ across $d/n$. Right: detection power grows with the signal-to-noise of a planted shared subspace, across CKA (linear and RBF), mutual $k$-NN, and RSA.

Two metric families, two regimes

The width analysis already split the standard similarity metrics into two families with very different null behaviour, and calibration is what makes the split visible. Spectral, interaction-based metrics (linear and RBF CKA, the RV coefficient, CCA/SVCCA/PWCCA, and shape-based measures like RSA) are all functionals of an interaction operator in the high-dimensional feature space, typically the cross-covariance or a Gram-matrix object. Under independent rows with mean zero and identity covariance, the cross-covariance has expected squared Frobenius norm $d_x d_y/(n-1)$, and the resulting CKA null inherits an $\mathcal{O}(d/n)$ scale. Calibration subtracts this baseline. For wide models that means subtracting a piece that grows with $d$, which leaves little of the cross-modal scaling trend behind. The same arithmetic applies to the RV coefficient and to CCA-style projections. The null moves with $d$, and calibration removes most of it.

Neighborhood-based metrics work on a different object. Mutual $k$-NN compares which points are each other's nearest neighbors at a fixed $k$ and asks for set overlap, not for an inner product in the feature space. Its null does not see $d$ directly: $\mathbb{E}_{H_0}[\mathrm{mKNN}_k] = k/(n-1)$. The same regime covers cycle-$k$-NN and the kernel-aware CKNNA, and we test all three in the paper. None of them inflate noticeably with $d$, so after calibration their reported trend on real models is essentially what was already there in the raw score.

Both closed forms, checked empirically. Toggle the metric and slide the sample size $n$. The dark markers are the empirical mean across $15$ trials of independent Gaussian representations. The maroon dashed line is the closed-form null. Spectral metrics climb with $d$. Neighborhood metrics do not.

sample size $n$ = 256

CKA, spectral. Empirical mean ±2σ across 15 trials of independent Gaussian $\mathbf{X}, \mathbf{Y}$ (dark markers) and the closed-form $d/(d+n-1)$ (maroon dashed) at $n = 256$. Doubling $n$ halves the null. Increasing $d$ inflates it.

The distinction is between families, not between one privileged metric and everything else. Anything whose statistic depends on representations only through fixed-size local neighborhood comparisons (mutual $k$-NN, cycle-$k$-NN, CKNNA, and other neighborhood-overlap measures) sits in the neighborhood regime. Anything that touches the interaction operator (spectral, second-moment, shape, CCA-style projection) sits in the spectral regime. Two families, run on the same models, can therefore report opposite scaling trends, and calibration sharpens that disagreement instead of softening it.

The conclusion deserves care. After calibration we find little empirical evidence of cross-modal global convergence on the val modelset of Huh et al. (2024) and on the video–language panel, and significant evidence of cross-modal local-neighborhood convergence on the same data. This is a statement about what the data shows once the metrics' own scale-dependent bias is removed. It is not a claim that global convergence cannot exist in larger or differently constructed settings. On the val modelset and the video–language panel, the global trend reported in the literature was largely an artefact of width and depth inflation. The local trend was already a real signal.

Revisiting the Platonic Representation Hypothesis

We follow the experimental protocol of Huh et al. (2024): $n = 1024$ image–text pairs from WIT, embeddings from three language model families (Bloomz, OpenLLaMA, LLaMA) and five vision model families (ImageNet-21K, MAE, DINOv2, CLIP, CLIP-finetuned), $204$ vision-language model pairs, layer-wise similarity reported as the maximum over layers. The only change is the calibration step on top.

CKA (global). Raw scores (dotted) rise with model scale, reproducing the trend cited as evidence of cross-modal convergence. Calibrated scores (solid) flatten.

For global metrics, calibration removes the scaling trend that the original work reported. The uncalibrated CKA grows with model capacity. The calibrated CKA does not. For local metrics, the picture is different: raw and calibrated curves lie close together and both rise with capacity. For local metrics, the raw-score convergence persists after calibration.

The same data as the curves above, shown at the level of individual model pairs. Pick a metric and switch between raw and calibrated. Hover any cell for the exact pair and score.

For CKA, the raw matrix is hot across the board. Calibration removes the scaling trend, but the residual cells are still positive: there is a small genuine global signal that the width-driven inflation was sitting on top of. For mutual k-NN the change is small. The local agreement was already in the raw score.

The same conclusion holds for other modalities

The framework is metric-agnostic and modality-agnostic. As a further test, we replace the image embeddings with video embeddings from a frozen video encoder and run the same protocol against the same panel of language models.

Video–language alignment on the same panel of LLMs. Left: RBF CKA, a global metric, drops sharply after calibration. Right: mutual $k$-NN, a local metric, raw and calibrated curves stay close across LLM capacity.

The same pattern appears in the video–language setting. Global RBF CKA loses most of its reported alignment after calibration, while mutual $k$-NN keeps its scaling trend.

The Aristotelian Representation Hypothesis

After calibration, the global similarity trend reported by spectral metrics like CKA largely flattens, while the local trend reported by neighborhood metrics like mutual $k$-NN does not. The original Platonic Representation Hypothesis is therefore not so much wrong as imprecise: it bundled two trends, only one of which the data still shows after calibration. We refine it to a smaller, but sharper claim about what we observe.

Neural networks, trained with different objectives on different data and modalities, converge to shared local neighborhood relationships.

Local alignment. The two spaces in the cartoon agree on which points cluster together while sitting in different coordinate layouts.

Why we call it Aristotelian

The name follows the philosophical structure of the refinement. Plato's account of universals places them in a transcendent realm: the particulars we observe are imperfect shadows of one true form. The Platonic Representation Hypothesis fits that picture, with networks approximating a single underlying geometry of the world.

Aristotle, Plato's student, takes the universal out of the transcendent realm and places it in the empirical pattern that holds across particulars. The form of a horse is what each horse has in common, observed in horses, not somewhere else. Categories instead of forms.

Mapped onto representations, the Platonic claim is that capable networks are imperfect projections of one true coordinate system. The refined claim is more modest. What we observe as shared across networks is the categorical structure: which points cluster, which neighborhoods recur. On the coordinates themselves, the calibrated metrics do not show a comparable agreement on the same data.

Each point is one concept embedded both in an image encoder and in a text encoder. The panels use different coordinate layouts. What is largely shared between them is the nearest-neighbor structure. Hover any point to see its $k$ nearest neighbors in both spaces. Shared neighbors are emphasized with a thicker ring, and labels appear (emoji on the image side, word on the text side). Drag $k$ to vary the neighborhood size, drag density to add more points.

Image space

Text space

k 5

density 30 points

mutual agreement at this point: — / —

The reframing changes what one can put to the test. A claim about shared global geometry is difficult to falsify: any specific metric that fails to find it can be dismissed as the wrong measurement. A claim about shared local neighborhoods is operational. For a given neighborhood size $k$, sample size $n$, and representation dimension $d$, one can ask whether the observed agreement exceeds a permutation null at level $\alpha$. On the val modelset, that agreement passes the test across two orders of magnitude in $k$.

We are not making a metaphysical claim about how networks recognise things. The argument is empirical: under the metrics we have and once their scale-dependent bias is removed, what we observe agreeing across networks is local structure. The corresponding global trend, after calibration, is not what the data shows on the same models. Local structure is what we recommend measuring when one talks about "shared representations".

Using calibration in your own work

If you measure representational similarity in your own work (transfer learning, distillation, interpretability, model comparison), the calibration framework is meant to drop into existing pipelines. The Python package is on PyPI:

pip install calibrated-similarity

The recipe is one function call:

from calibrated_similarity import calibrate

result = calibrate(metric=cka, X=feats_a, Y=feats_b, n_permutations=1000, alpha=0.05)
# result.score        -> calibrated similarity in [0, 1]
# result.p_value      -> permutation p-value
# result.tau          -> the (1 - alpha) quantile of the null

The notes below cover the choices you have to make in practice.

Choosing the number of permutations

The number of permutations $K$ controls the resolution of the null: $K$ permutations give a null of size $K + 1$ (the observed score counts in the calibrated quantile). The smallest detectable $p$-value is $1/(K+1)$, so $K = 999$ gives $p \ge 0.001$. For headline numbers, $K$ between $1000$ and $10000$ works. For screening many comparisons, $K = 200$ is usually enough to get a stable $\tau_\alpha$.

$\tau_\alpha$ stabilises quickly with $K$. A few hundred permutations are enough for a stable threshold; the gain from more is small after about $K = 1000$.

Picking the significance level ($\alpha$) and handling many comparisons

The significance level $\alpha$ controls the false-positive rate on a single comparison. With $M = L_A \cdot L_B$ comparisons (an $L_A \times L_B$ grid of language and vision models), naive $\alpha = 0.05$ thresholding gives about $0.05 \cdot M$ false positives in expectation. For the val modelset with $M = 204$ pairs that is roughly 10 spurious cells. Either tighten $\alpha$ or, more cleanly, apply a Benjamini-Hochberg FDR correction on the permutation $p$-values. The package returns these directly.

from calibrated_similarity import calibrate, fdr_threshold

scores = [...] # raw similarities, shape (L, V)
pvalues = [...] # permutation p-values, shape (L, V)

mask = fdr_threshold(pvalues, q=0.05) # cells that pass FDR at q=0.05

When the null threshold hits the ceiling ($\tau_\alpha \ge 1$) or the calibrated score is zero ($s_{\mathrm{cal}} = 0$)

A null threshold $\tau_\alpha \ge 1$ means the null already reaches the metric's maximum. That usually means your $d/n$ ratio is so large that the metric is uninformative. Increase $n$, reduce $d$ (e.g. PCA the representations to a comparable dimension), or switch to a metric whose null does not blow up. A calibrated score $s_{\mathrm{cal}} = 0$ with a non-trivial raw score means your observed similarity does not exceed the random-permutation baseline at level $\alpha$. That is a real negative result, not a bug.

Aggregation-aware calibration

If your reported number is a summary over many cells (max alignment, top-$k$ mean, row-wise max), the null must match the aggregation. The package supports this directly:

aggregated = calibrate(
    metric=cka,
    X=feats_a, Y=feats_b,
    n_permutations=1000,
    aggregator="top_k_mean",  # "max", "rowmax_mean", "topk_mean", ...
    aggregator_k=5,
)

The raw max of an $L_A \times L_B$ matrix and the matched calibrated max are different statistics. The calibrated version accounts for the look-elsewhere effect. The raw version does not.

What calibration does not fix

Calibration corrects for representational scale (width, depth, dimension-related null inflation). It does not fix the choice of probe dataset, distribution shift between the data you measure on and the data the models were trained on, or selection bias in which models you compare. Those need separate care.

Resources

BibTeX

@article{groger2026revisiting,
  title   = {Revisiting the Platonic Representation Hypothesis: An Aristotelian View},
  author  = {Gr{\"o}ger, Fabian and Wen, Shuo and Brbi{\'c}, Maria},
  journal = {International Conference on Machine Learning (ICML)},
  year    = {2026}
}