Compressing the collective knowledge of ESM into a single protein language model (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41592-026-03050-9.pdf

Compressing the collective knowledge of ESM into a single protein language model

nature methods Article https://doi.org/10.1038/s41592-026-03050-9 Compressing the collective knowledge of ESM into a single protein language model Received: 10 May 2025 Tuan Dinh1, Seon-Kyeong Jang2, Noah Zaitlen2,3,4 & Vasilis Ntranos 1,5,6,7 Accepted: 26 February 2026 Published online: xx xx xxxx Check for updates Protein language models (PLMs) have recently emerged as a promising approach for next-generation variant-effect prediction (VEP). Most high-performing VEP methods currently utilize PLMs combined with additional information, such as homology, protein structure and population genetics data to improve prediction accuracy. This performance gain, however, comes with added complexity or limited applicability compared to pure PLMs trained only on raw, unaligned sequences, such as evolutionary scale modeling (ESM). Here we challenge the prevailing view that sequence-only PLMs are intrinsically limited and present an efficient co-distillation approach to adapt them for high-accuracy VEP without requiring additional information beyond evolutionary signals captured during pretraining. We allow individual PLMs to self-improve by distilling the most confident predictions from multiple models of the same family and demonstrate that co-distillation of ESM models suffices to achieve state-of-the-art performance across multiple VEP benchmarks. We further show that this performance increase enables accurate quantification of the severity of variant effects on continuous clinical phenotypes in biobank data. Predicting the functional consequences of genetic variants, known as variant effect prediction (VEP), is a fundamental computational challenge with important applications in human genetics, drug development and protein engineering1–5. In recent years, the field of VEP has seen substantial advancements, particularly with the emergence of protein language models (PLMs)6–13. These models, inspired by transformative developments in natural language processing, are trained on vast repositories of protein sequences8 to capture the intricate patterns and relationships within the protein sequence space, demonstrating remarkable capabilities in VEP and other protein-related tasks, including protein structure prediction9,14, function annotation15 and protein design16. To further increase predictive performance, VEP methods often employ a hybrid approach, combining PLMs with additional sources of information such as multiple sequence alignments (MSA), protein structures and population genetics data. Indeed, the top-performing methods on the ProteinGym Deep Mutational Scan (DMS) benchmark17, including Saprot14, TranceptEVE18 and others19,20, are all PLMs that have been trained on three-dimensional (3D) structure or MSA data and use this information during inference to make accurate predictions of variant effects. The recently developed closed-source models PrimateAI-3D13 and AlphaMissense7 also follow a hybrid approach, demonstrating similar performance gains by integrating both modalities with additional fine-tuning on human and primate population genetics data. Additionally, in theory, protein sequence information alone should be sufficient to achieve the same level of accuracy, at the time of testing, sequence-only PLMs are substantially underperforming compared to these more recent approaches. Although effective, these hybrid approaches come with increased complexity and may have limited applicability compared Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA. 2Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA. 3Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA. 4Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA. 5Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA. 6 Diabetes Center, University of California, San Francisco, San Francisco, CA, USA. 7Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA. e-mail: 1 Nature Methods Article to sequence-only PLMs in scenarios where such additional data are unavailable, incomplete or computationally expensive to obtain. Moreover, the integration of diverse data sources can introduce biases and dependencies that may affect both the interpretability and generalizability of the predictions in downstream tasks. Experimental 3D structures, for example, are only available for a small fraction of proteins, algorithmic and hyperparameter choices can have variable effects on MSA quality, and the use of population genetics data can lead to data circularity concerns in clinical genetics applications. Even though having this extra information alongside the sequence can certainly help on average, it is not clear how each modality contributes to each prediction and, more importantly, how reliable the corresponding model can be in its absence. The performance of the sequence-only PLM versions of TranceptEVE and SaProt, for example, in the absence of MSA and 3D structure decreases substantially, indicating a strong dependence on the availability of these additional modalities14. By contrast, pure evolutionary scale PLMs, trained solely on unaligned protein sequences, have the potential to offer a more balanced, streamlined and broadly applicable approach to VEP. In this work, we focus on the widely adopted, sequence-only PLMs of the evolutionary scale modeling (ESM) family8,9,21 and show that their performance is not fundamentally limited compared to the above more complex modeling approaches. In particular, we demonstrate that better detection of the evolutionary signals captured during the pretraining of different ESM models has the capacity to substantially increase the VEP performance of all individual models without using any external information. To achieve this, we introduce a co-distillation framework within which multiple models can learn from each other, alternating their role as teachers and students depending on their estimated confidence for each prediction; the log-likelihood ratio (LLR) of the model that provides the most confident prediction for any given mutation is used to refine the predictions of the others. Through comprehensive evaluation across multiple VEP benchmarks, we demonstrate that our approach produces substantially improved variant-effect ESM models (VESM) that match or surpass current state-of-the-art VEP methods, effectively closing the performance gap between hybrid and sequence-only PLMs. We further extend this framework to incorporate structural information in a modular fashion, preserving VESM’s advantages while improving performance on structure-dependent tasks. Finally, we demonstrate our models’ capacity to accurately quantif (...truncated)