Compressing the collective knowledge of ESM into a single protein language model
nature methods
Article
https://doi.org/10.1038/s41592-026-03050-9
Compressing the collective knowledge of
ESM into a single protein language model
Received: 10 May 2025
Tuan Dinh1, Seon-Kyeong Jang2, Noah Zaitlen2,3,4 & Vasilis Ntranos
1,5,6,7
Accepted: 26 February 2026
Published online: xx xx xxxx
Check for updates
Protein language models (PLMs) have recently emerged as a promising
approach for next-generation variant-effect prediction (VEP). Most
high-performing VEP methods currently utilize PLMs combined with
additional information, such as homology, protein structure and
population genetics data to improve prediction accuracy. This performance
gain, however, comes with added complexity or limited applicability
compared to pure PLMs trained only on raw, unaligned sequences, such as
evolutionary scale modeling (ESM). Here we challenge the prevailing view
that sequence-only PLMs are intrinsically limited and present an efficient
co-distillation approach to adapt them for high-accuracy VEP without
requiring additional information beyond evolutionary signals captured
during pretraining. We allow individual PLMs to self-improve by distilling
the most confident predictions from multiple models of the same family
and demonstrate that co-distillation of ESM models suffices to achieve
state-of-the-art performance across multiple VEP benchmarks. We further
show that this performance increase enables accurate quantification of the
severity of variant effects on continuous clinical phenotypes in biobank data.
Predicting the functional consequences of genetic variants, known as
variant effect prediction (VEP), is a fundamental computational challenge with important applications in human genetics, drug development and protein engineering1–5. In recent years, the field of VEP has
seen substantial advancements, particularly with the emergence of
protein language models (PLMs)6–13. These models, inspired by transformative developments in natural language processing, are trained
on vast repositories of protein sequences8 to capture the intricate
patterns and relationships within the protein sequence space, demonstrating remarkable capabilities in VEP and other protein-related
tasks, including protein structure prediction9,14, function annotation15
and protein design16.
To further increase predictive performance, VEP methods often
employ a hybrid approach, combining PLMs with additional sources
of information such as multiple sequence alignments (MSA), protein
structures and population genetics data. Indeed, the top-performing
methods on the ProteinGym Deep Mutational Scan (DMS) benchmark17,
including Saprot14, TranceptEVE18 and others19,20, are all PLMs that
have been trained on three-dimensional (3D) structure or MSA data
and use this information during inference to make accurate predictions of variant effects. The recently developed closed-source models
PrimateAI-3D13 and AlphaMissense7 also follow a hybrid approach,
demonstrating similar performance gains by integrating both modalities with additional fine-tuning on human and primate population
genetics data. Additionally, in theory, protein sequence information
alone should be sufficient to achieve the same level of accuracy, at the
time of testing, sequence-only PLMs are substantially underperforming
compared to these more recent approaches.
Although effective, these hybrid approaches come with
increased complexity and may have limited applicability compared
Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA. 2Department of Computational Medicine,
David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA. 3Department of Human Genetics, David Geffen School
of Medicine, University of California, Los Angeles, Los Angeles, CA, USA. 4Department of Neurology, David Geffen School of Medicine, University of
California, Los Angeles, Los Angeles, CA, USA. 5Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA.
6
Diabetes Center, University of California, San Francisco, San Francisco, CA, USA. 7Bakar Computational Health Sciences Institute, University of
California, San Francisco, San Francisco, CA, USA.
e-mail:
1
Nature Methods
Article
to sequence-only PLMs in scenarios where such additional data are unavailable, incomplete or computationally expensive to obtain. Moreover, the integration of diverse data sources can introduce biases and
dependencies that may affect both the interpretability and generalizability of the predictions in downstream tasks. Experimental 3D structures, for example, are only available for a small fraction of proteins,
algorithmic and hyperparameter choices can have variable effects on
MSA quality, and the use of population genetics data can lead to data
circularity concerns in clinical genetics applications. Even though having this extra information alongside the sequence can certainly help on
average, it is not clear how each modality contributes to each prediction
and, more importantly, how reliable the corresponding model can be
in its absence. The performance of the sequence-only PLM versions of
TranceptEVE and SaProt, for example, in the absence of MSA and 3D
structure decreases substantially, indicating a strong dependence on
the availability of these additional modalities14. By contrast, pure evolutionary scale PLMs, trained solely on unaligned protein sequences,
have the potential to offer a more balanced, streamlined and broadly
applicable approach to VEP.
In this work, we focus on the widely adopted, sequence-only PLMs
of the evolutionary scale modeling (ESM) family8,9,21 and show that their
performance is not fundamentally limited compared to the above
more complex modeling approaches. In particular, we demonstrate
that better detection of the evolutionary signals captured during the
pretraining of different ESM models has the capacity to substantially
increase the VEP performance of all individual models without using
any external information. To achieve this, we introduce a co-distillation
framework within which multiple models can learn from each other,
alternating their role as teachers and students depending on their
estimated confidence for each prediction; the log-likelihood ratio
(LLR) of the model that provides the most confident prediction for
any given mutation is used to refine the predictions of the others.
Through comprehensive evaluation across multiple VEP benchmarks,
we demonstrate that our approach produces substantially improved
variant-effect ESM models (VESM) that match or surpass current
state-of-the-art VEP methods, effectively closing the performance
gap between hybrid and sequence-only PLMs. We further extend this
framework to incorporate structural information in a modular fashion, preserving VESM’s advantages while improving performance
on structure-dependent tasks. Finally, we demonstrate our models’
capacity to accurately quantif (...truncated)