Quantifying uncertainty in protein representations across models and tasks (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41592-026-03028-7.pdf

Quantifying uncertainty in protein representations across models and tasks

nature methods Article https://doi.org/10.1038/s41592-026-03028-7 Quantifying uncertainty in protein representations across models and tasks Received: 15 May 2025 R. Prabakaran 1 & Yana Bromberg 1,2 Accepted: 11 February 2026 Published online: 1 April 2026 Check for updates Biomolecular embeddings serve as efficient representations of sequence and structure, enabling tasks such as similarity searches, structure and function prediction and estimation of biophysical properties. However, relying on embeddings without assessing their ability to accurately represent biomolecules is a critical flaw—akin to using a scalpel in surgery without verifying its sharpness. Here we propose a means to evaluate the capacity of protein language models to encode biologically meaningful information. For each protein, representation uncertainty is scored as the fraction of non-biological ‘synthetic’ sequences among its nearest neighbors in latent space. Our analysis reveals that low-quality embeddings often fail to capture meaningful biology, displaying vector properties indistinguishable from those of randomly generated sequences. Our model-agnostic scoring framework is, to our knowledge, the first to quantify protein sequence embedding reliability. It enables embedding screening prior to downstream applications and inferences, significantly improving their reliability. We propose that embedding evaluation should be undertaken for other uses of language models in science as well. Language models (LMs), originally developed for natural language processing (NLP)1, are increasingly accepted as the preferred in silico representation of the primary and higher-order structures of protein, DNA and RNA2–5. Their ability to learn an encoding that captures many aspects of a given biomolecule from simple amino or nucleic acid sequence has made them a promising tool for deriving biological insights6–11. LMs encode a biomolecule as an embedding—that is, a sequence of numbers representing a point in a multidimensional latent space. Embeddings serve as powerful computational proxies for facilitating a range of downstream tasks, such as similarity searches, structural and functional annotations and prediction of biomolecule properties12–15. For instance, embeddings from protein language models (pLMs) have been used to predict protein function, mutation effect and subcellular localization, achieving performance that rivals or surpasses traditional methods13,14,16–19. Additionally, fine-tuning pretrained pLMs has been shown to enhance predictions across multiple additional biological tasks, underscoring the versatility of these models2,20,21. Despite the advantages of embeddings as biomolecular representations, the reliability or confidence of an embedding remains largely unquestioned. Unlike most machine-learning-based predictions that have a corresponding prediction probability/reliability score, a given embedding is not questioned as a representation of a protein any more than a protein sequence would be. Embeddings are low-dimensional representation of biomolecules in the latent space of the LM, with each vector element serving as a coordinate in the map of this space. Coordinates are learned to encode the training data while minimizing the loss associated with the training tasks22. The model’s uncertainty or confidence in an embedding originates from the same sources as any of its predictions—the LM’s training process, optimized to reach a computationally feasible solution that balances task performance within cost and time constraints, rather than achieving complete learning or a globally optimal representation. Put simply, the latent space of a model is just one of many possible optimal mappings for the given training dataset and the training objective. Moreover, datasets may not comprehensively capture the full sequence space—a limitation that is, arguably, even more obvious for protein sequences than for human languages. As a result, each protein’s projection into the latent space carries an inherent uncertainty, Department of Biology, Emory University, Atlanta, GA, USA. 2Department of Computer Science, Emory University, Atlanta, GA, USA. e-mail: ; 1 Nature Methods | Volume 23 | April 2026 | 796–804 796 Article https://doi.org/10.1038/s41592-026-03028-7 a c Mean pLDDT 100 80 60 40 20 0 0 0.2 0.4 0.6 0.8 1.0 TM score 1.00 *** ** *** 0.95 0.90 0.85 Low Moderate 1.0 Correlation 0.8 Excellent 0.6 0.4 0.2 ρ τ −0.2 −0.4 −0.6 t-SNE 0 00 Astral40R Low Moderate High Excellent 1,0 50 10 0 50 0 1,0 00 10 50 k 0 −0.8 0 10 RNSk(Pi) High e 10 d 50 Cosine similarity against random b k Fig. 1 | Protein structure prediction quality as a function of embedding certainty. Across a−d, color indicates the prediction confidence levels, judged by TM scores of ESM-2-predicted versus experimentally determined structures of Astral40 domains (n = 14,711): excellent (blue, TM > 0.9, n = 10,251), high (green, 0.7 < TM ≤ 0.9, n = 3,040), moderate (orange, 0.5 < TM ≤ 0.7, n = 768) and low (red, TM ≤ 0.5, n = 652). a, ESM-2-predicted structure quality (mean pLDDT, y axis) is, as expected, correlated with TM scores (x axis) of alignments of predicted versus experimental structures across Astral40 domains. b, Average cosine similarity (y axis) of Astral40’s ESM-2 embeddings to a set of randomly generated, biologically irrelevant sequences in Astral40R (n = 73,555) differs across structure prediction confidence levels (x axis)—that is, proteins with poor prediction confidence (red box) tend to have more similarity to random embeddings than high-confidence predictions (green and blue). Blue circles are outliers beyond 1.5× IQR from the first and third quartiles of distribution. Statistical significance—***P ≤ 1 × 10−3, **P ≤ 1 × 10−2 and ‘NS’ otherwise, between low versus moderate (6.55 × 10−5), moderate versus high (4.29 × 10−3) and high versus excellent (9.35 × 10−50)—was assessed using twosided Mann−Whitney U-test. Purple stars denote the mean cosine similarity of the Astral40R embeddings within each set (box), further highlighting the distinction between low-quality and high-quality embeddings versus random. c, Two-dimensional t-SNE projections of ESM-2 embeddings for Astral40 and Astral40R (gray) illustrate the specifics of the overlap of the two sets. That is, lowscoring protein embeddings (red dots and density lines) fall into the latent space also covered by random sequences (gray). At the same time, the latent space of excellent-scoring embeddings (blue) is nearly disjoint from the random space (gray). d, RNS (y axis), computed across varying values of k (k nearest neighbors, x axis), effectively discriminates ESM-2 embeddings corresponding to lowconfidence structures (TM ≤ 0.5, red line) from higher-quality ones (all others). Error bars indicate the 95% confidence interval of mean RNS, derived through 100× bootstrapping. e, RNS is moderately inversely co (...truncated)