Auditing a Scientific Field through its Representations
A global atlas of digital dermatology

Under review · 2025

¹University of Basel · ²HSLU · ³Northwestern University

Once we know when to trust representation geometry, the same geometry can audit not just a single dataset but an entire scientific archive. We aggregate 1,135,080 dermatology images from 29 public datasets into one shared representation space and quantify the field's structure from that geometry: how datasets overlap, where new ones genuinely add information, and which clinical phenotypes are missing entirely. The result is the first quantitative atlas of dermatology, showing systematic demographic skew, diminishing returns on new releases, and structural voids in the field's clinical coverage.

The SkinMap framework. (a) Aggregation of 29 public datasets to 1.1 million unique images with sparse original metadata. (b) Multi-modal training pipeline: self-supervised image encoders and image-text contrastive learning over templated captions. (c) Projection into a shared low-dimensional manifold. (d) Imputation engine: linear probes predict missing attributes (Fitzpatrick skin type, age, sex, geography). (e) Applications: clinician-facing semantic search, structural overlap audit, strategic data-acquisition guide.

The dermatology archive at field scale

Dermatology AI's clinical reliability depends on the quality, diversity, and scale of the data it is trained on. For years the community has presumed the archive is biased: that pediatric patients are underrepresented, that the Global North dominates, that dark skin tones are missing. These have been intuitions and partial single-dataset audits, not a quantified field-level picture. SkinMap measures the bias the community has long presumed.

Without a field-level picture, the community operates without key performance indicators for data acquisition. New datasets get collected and released without a way to tell whether they expand the clinical map or replicate what is already known. Models trained on the resulting corpus inherit any blind spots silently, and external validations on nominally distinct datasets can report inflated generalisation when the datasets share representation structure.

Representation geometry gives us the lens. A learned representation organises samples by similarity, so distances between dataset embeddings, neighbourhood structure, and topological properties become directly meaningful for questions about coverage and redundancy. We build a single embedding space across 29 public datasets and use that space to measure what the field actually contains.

Building the atlas

The atlas combines two complementary representations, both trained from scratch on the union of the 29 datasets. Self-supervised image encoders learn a space that respects pure visual similarity, with no labels and no off-the-shelf foundation-model weights involved. In parallel, an image-text contrastive model is trained on templated captions assembled from the available metadata, which anchors clinical terminology into the visual manifold. Both streams are then projected into a shared low-dimensional space via a learned projector and combined into an ensemble. We train from the data itself rather than fine-tuning a general-purpose foundation model. The resulting space is then shaped by the dermatology archive rather than by the inductive biases of an upstream pretraining set, so the audit measures the data and not an outside prior.

Harmonising metadata across the 29 datasets, let alone adding new attributes, had not been attempted at this scale. Each archive uses its own schema, and many lack key fields entirely. One of the long-standing complaints about public dermatology data is precisely that the field could not even quantify its own demographic composition. On top of the shared manifold, a set of linear probes is trained on the partially annotated subset to predict missing demographic attributes (Fitzpatrick skin type, age, sex, geographic origin). The probes turn the sparsely annotated archive into a fully attributed one: imputed Fitzpatrick coverage expands by +97.1 pp, geography by +54.3 pp. We validate the imputation engine in two ways. On strictly held-out datasets (DDI, PAD-UFES-20), the SkinMap ensemble outperforms state-of-the-art foundation models (MONET, PanDerm) on attribute prediction. On a panel study, five practising dermatologists annotated the same 150 cases independently, and they agreed with the model's predictions more often than they agreed with each other. With this clinically validated engine in place, the field-level audit is no longer constrained to the labelled fraction.

Imputation precision against state-of-the-art foundation models. (a, b) Radar plots compare the SkinMap ensemble against MONET and PanDerm on internal validation and external hold-out datasets across Fitzpatrick skin type, anatomical region, body part, age, and sex. The ensemble matches or exceeds both baselines on the metadata-recovery task. (c) Metadata coverage before and after imputation, showing where the largest gains accrue.

What the audit shows

With one embedding space across the entire archive, three geometric questions become measurable: who is represented (demographic coverage), how much each new dataset actually adds (novelty), and what is structurally missing (topological voids).

Demographic biases

Pooled across all 29 datasets and using the imputation engine for missing labels, the archive remains markedly skewed. Fitzpatrick V–VI account for 11.0% of images, pediatric patients (≤ 18 years) 2.3%, and geographic concentration sits in the Global North.

Data reality check across the unified atlas. The atlas distribution (filled bars) is overlaid against the corresponding reference population where one exists. Fitzpatrick V–VI, pediatric patients, and non-Global-North geographies are systematically underrepresented; the gap between the atlas and the reference is the quantified skew.

Diminishing returns on new datasets

Dataset size has grown exponentially over the past decade. Novelty has not. We define yearly novelty as the share of each new dataset that lies in a region of the shared latent space not already covered by the existing archive. Across the 29-dataset chronology, novelty plateaus around the early 2020s and remains flat, even as image counts continue to rise.

To quantify pairwise overlap directly, we compute Fréchet distances between the latent distributions of every dataset pair. Several nominally distinct datasets sit at small Fréchet distance from each other in our embedding space, which means that external validations conducted across these pairs are not genuinely out-of-distribution.

Novelty and redundancy. (left) Yearly novelty across the 29 datasets: the marginal information each new release adds to the atlas. (right) Pairwise Fréchet distance matrix. Smaller distances indicate dataset pairs that share representation structure and therefore do not constitute genuinely out-of-distribution test sets for each other.

Pairwise Fréchet distance between the 29 datasets in the shared embedding space. Smaller distances (lighter cells) mean two datasets sit near each other in latent space and therefore do not constitute genuinely independent test sets for each other. Hover any cell for the exact pair and the distance.

Several nominally distinct datasets share small Fréchet distance. External validation between such pairs reports similarity rather than out-of-distribution generalisation.

Structural voids

Some gaps are not just about who or what is missing in absolute counts. They are topological: regions of the embedding space where no clinical phenotype lives, surrounded by populated regions. We apply spectral persistent homology to detect these voids directly in the latent geometry, without relying on any metadata.

The most prominent void in the dermatology atlas is nail pathology. Of about 7,800 nail-related images, the bulk concentrates on unspecified nail disease, psoriasis, and onychomycosis. Subungual melanoma, yellow nail syndrome, onychogryphosis, trachyonychia, glomus tumors, koilonychia, habit-tic deformity, myxoid cysts, and Beau's lines are absent or severely underrepresented (under 200 images each). We do not observe a visual phenotype for these conditions in the global archive, so the scarcity is not explained by missing labels.

Topological voids in the shared latent space. Persistent homology surfaces regions where the embedding manifold has structural holes, which correspond to clinical phenotypes that no public dataset currently covers in sufficient detail. Several voids align with rare nail pathologies confirmed against a comprehensive clinical taxonomy.

Using the atlas

Beyond the diagnostic audit, the same shared embedding space drives three immediate applications.

Clinician-facing semantic search. Uploaded cases are projected onto the manifold in real time, and the system returns clinically similar cases drawn from the entire 1.1 million-image archive. Retrieval is no longer constrained to a single dataset.

Structural overlap audit. Pairwise Fréchet distances quantify which datasets are genuinely independent. External validations using nominally distinct but representationally similar datasets can be flagged before they inflate reported generalisation.

Strategic data acquisition. The novelty curve and persistent-homology voids identify where the next dataset should focus. Healthy skin in Fitzpatrick V–VI, pediatric cases, and rare nail pathologies are concrete acquisition targets where the atlas predicts the largest marginal coverage gain.

0.16 → 0.93 AP@10 on DDI retrieval

Retrieval boost. A within-dataset retrieval baseline on DDI reaches AP@10 of 0.16. The same query against the full SkinMap atlas reaches 0.93. Cross-dataset semantic similarity, made possible by a unified embedding space, turns a specialised cohort into a queryable surface over the entire archive.

The digital atlas. (a, b) Clinicians upload cases, projected onto the shared manifold in real time and surrounded by nearest cases from the atlas. (c–e) Retrieval benchmarks on held-out datasets (DDI and PAD-UFES-20): cross-atlas retrieval beats within-dataset retrieval by a wide margin, particularly for specialised cohorts.

The complete SkinMap framework, including the model ensemble, the imputation engine, and the digital atlas, will be released open source alongside a live web app for clinical and research use.

Resources

SkinMap. "A Global Atlas of Digital Dermatology to Map Innovation and Disparities". Under review. arXiv 2601.00840
Source code and live digital atlas: open-source release pending acceptance.

BibTeX

@article{groger2025skinmap,
  title   = {A Global Atlas of Digital Dermatology to Map Innovation and Disparities},
  author  = {Gr{\"o}ger, Fabian and Lionetti, Simone and Gottfrois, Philippe and
             Gonzalez-Jimenez, Alvaro and Habermacher, Lea and Amruthalingam, Ludovic and
             Groh, Matthew and Pouly, Marc and Navarini, Alexander A. and
             {Labelling Consortium}},
  journal = {Under review},
  year    = {2025},
  note    = {arXiv preprint arXiv:2601.00840}
}