Quantifying uncertainty in protein representations across models and tasks
nature methods
Article
https://doi.org/10.1038/s41592-026-03028-7
Quantifying uncertainty in protein
representations across models and tasks
Received: 15 May 2025
R. Prabakaran
1
& Yana Bromberg
1,2
Accepted: 11 February 2026
Published online: 1 April 2026
Check for updates
Biomolecular embeddings serve as efficient representations of sequence
and structure, enabling tasks such as similarity searches, structure and
function prediction and estimation of biophysical properties. However,
relying on embeddings without assessing their ability to accurately
represent biomolecules is a critical flaw—akin to using a scalpel in surgery
without verifying its sharpness. Here we propose a means to evaluate the
capacity of protein language models to encode biologically meaningful
information. For each protein, representation uncertainty is scored as the
fraction of non-biological ‘synthetic’ sequences among its nearest neighbors
in latent space. Our analysis reveals that low-quality embeddings often fail to
capture meaningful biology, displaying vector properties indistinguishable
from those of randomly generated sequences. Our model-agnostic scoring
framework is, to our knowledge, the first to quantify protein sequence
embedding reliability. It enables embedding screening prior to downstream
applications and inferences, significantly improving their reliability. We
propose that embedding evaluation should be undertaken for other uses of
language models in science as well.
Language models (LMs), originally developed for natural language
processing (NLP)1, are increasingly accepted as the preferred in silico
representation of the primary and higher-order structures of protein, DNA and RNA2–5. Their ability to learn an encoding that captures
many aspects of a given biomolecule from simple amino or nucleic
acid sequence has made them a promising tool for deriving biological insights6–11. LMs encode a biomolecule as an embedding—that is,
a sequence of numbers representing a point in a multidimensional
latent space. Embeddings serve as powerful computational proxies for
facilitating a range of downstream tasks, such as similarity searches,
structural and functional annotations and prediction of biomolecule
properties12–15. For instance, embeddings from protein language models
(pLMs) have been used to predict protein function, mutation effect and
subcellular localization, achieving performance that rivals or surpasses
traditional methods13,14,16–19. Additionally, fine-tuning pretrained pLMs
has been shown to enhance predictions across multiple additional
biological tasks, underscoring the versatility of these models2,20,21.
Despite the advantages of embeddings as biomolecular representations, the reliability or confidence of an embedding remains largely
unquestioned. Unlike most machine-learning-based predictions that
have a corresponding prediction probability/reliability score, a given
embedding is not questioned as a representation of a protein any more
than a protein sequence would be.
Embeddings are low-dimensional representation of biomolecules
in the latent space of the LM, with each vector element serving as a
coordinate in the map of this space. Coordinates are learned to encode
the training data while minimizing the loss associated with the training
tasks22. The model’s uncertainty or confidence in an embedding originates from the same sources as any of its predictions—the LM’s training
process, optimized to reach a computationally feasible solution that
balances task performance within cost and time constraints, rather
than achieving complete learning or a globally optimal representation. Put simply, the latent space of a model is just one of many possible optimal mappings for the given training dataset and the training
objective. Moreover, datasets may not comprehensively capture the
full sequence space—a limitation that is, arguably, even more obvious
for protein sequences than for human languages. As a result, each protein’s projection into the latent space carries an inherent uncertainty,
Department of Biology, Emory University, Atlanta, GA, USA. 2Department of Computer Science, Emory University, Atlanta, GA, USA.
e-mail: ;
1
Nature Methods | Volume 23 | April 2026 | 796–804
796
Article
https://doi.org/10.1038/s41592-026-03028-7
a
c
Mean pLDDT
100
80
60
40
20
0
0
0.2
0.4
0.6
0.8
1.0
TM score
1.00
***
**
***
0.95
0.90
0.85
Low
Moderate
1.0
Correlation
0.8
Excellent
0.6
0.4
0.2
ρ
τ
−0.2
−0.4
−0.6
t-SNE
0
00
Astral40R
Low
Moderate
High
Excellent
1,0
50
10
0
50
0
1,0
00
10
50
k
0
−0.8
0
10
RNSk(Pi)
High
e
10
d
50
Cosine similarity
against random
b
k
Fig. 1 | Protein structure prediction quality as a function of embedding
certainty. Across a−d, color indicates the prediction confidence levels, judged
by TM scores of ESM-2-predicted versus experimentally determined structures
of Astral40 domains (n = 14,711): excellent (blue, TM > 0.9, n = 10,251), high
(green, 0.7 < TM ≤ 0.9, n = 3,040), moderate (orange, 0.5 < TM ≤ 0.7, n = 768)
and low (red, TM ≤ 0.5, n = 652). a, ESM-2-predicted structure quality (mean
pLDDT, y axis) is, as expected, correlated with TM scores (x axis) of alignments of
predicted versus experimental structures across Astral40 domains.
b, Average cosine similarity (y axis) of Astral40’s ESM-2 embeddings to a set of
randomly generated, biologically irrelevant sequences in Astral40R (n = 73,555)
differs across structure prediction confidence levels (x axis)—that is, proteins
with poor prediction confidence (red box) tend to have more similarity to
random embeddings than high-confidence predictions (green and blue).
Blue circles are outliers beyond 1.5× IQR from the first and third quartiles
of distribution. Statistical significance—***P ≤ 1 × 10−3, **P ≤ 1 × 10−2 and ‘NS’
otherwise, between low versus moderate (6.55 × 10−5), moderate versus high
(4.29 × 10−3) and high versus excellent (9.35 × 10−50)—was assessed using twosided Mann−Whitney U-test. Purple stars denote the mean cosine similarity
of the Astral40R embeddings within each set (box), further highlighting the
distinction between low-quality and high-quality embeddings versus random.
c, Two-dimensional t-SNE projections of ESM-2 embeddings for Astral40 and
Astral40R (gray) illustrate the specifics of the overlap of the two sets. That is, lowscoring protein embeddings (red dots and density lines) fall into the latent space
also covered by random sequences (gray). At the same time, the latent space of
excellent-scoring embeddings (blue) is nearly disjoint from the random space
(gray). d, RNS (y axis), computed across varying values of k (k nearest neighbors,
x axis), effectively discriminates ESM-2 embeddings corresponding to lowconfidence structures (TM ≤ 0.5, red line) from higher-quality ones (all others).
Error bars indicate the 95% confidence interval of mean RNS, derived through
100× bootstrapping. e, RNS is moderately inversely co (...truncated)