Protein 3D Structure Computed from Evolutionary Sequence Variation

PLOS ONE, Dec 2019

The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes.

Protein 3D Structure Computed from Evolutionary Sequence Variation

et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 Protein 3D Structure Computed from Evolutionary Sequence Variation Debora S. Marks 0 Lucy J. Colwell 0 Robert Sheridan 0 Thomas A. Hopf 0 Andrea Pagnani 0 Riccardo 0 Zecchina 0 Chris Sander 0 Andrej Sali, University of California San Francisco, United States of America 0 1 Department of Systems Biology, Harvard Medical School , Boston , Massachusetts, United States of America, 2 MRC Laboratory of Molecular Biology , Hills Road, Cambridge , United Kingdom , 3 Computational Biology Center, Memorial Sloan-Kettering Cancer Center , New York , New York, United States of America , 4 Human Genetics Foundation, Torino, Italy, 5 Politecnico di Torino, Torino , Italy The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7-4.8 A Ca-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes. - Funding: CS and RS have support from the Dana Farber Cancer Institute-Memorial Sloan-Kettering Cancer Center Physical Sciences Oncology Center (NIH U54CA143798). LC is supported by an Engineering and Physical Sciences Research Council fellowship (EP/H028064/1). TH has support from the German National Academic Foundation. RZ has support from European Community grant 267915. No other financial support was received for the research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. . These authors contributed equally to this work. Exploiting the evolutionary record in protein families The evolutionary process constantly samples the space of possible sequences and, by implication, structures consistent with a functional protein in the context of a replicating organism. Homologous proteins from diverse organisms can be recognized by sequence comparison because strong selective constraints prevent amino acid substitutions in particular positions from being accepted. The beauty of this evolutionary record, reported in protein family databases such as PFAM [1], is the balance between sequence exploration and constraints: conservation of function within a protein family imposes strong boundaries on sequence variation and generally ensures similarity of 3D structure among all family members [2] (Figure 1). In particular, to maintain energetically favorable interactions, residues in spatial proximity may co-evolve across a protein family [2,3]. This suggests that residue correlations could provide information about amino acid residues that are close in structure [4,5,6,7,8,9,10,11]. However, correlated residue pairs within a protein are not necessarily close in 3D space. Confounding residue correlations may reflect constraints that are not due to residue proximity but are nevertheless true biological evolutionary constraints or, they could simply reflect correlations arising from the limitations of our insight and technical noise. Evolutionary constraints on residues involved in oligomerization, proteinprotein, or protein-substrate interactions or other spatially indirect or spatially distributed interactions can result in co-variation between residues not in close spatial proximity within a protein monomer. In addition, the principal technical causes of confounding residue correlations are transitivity of correlations, statistical noise due to small numbers and phylogenetic sampling bias in the set of sequences assembled in the protein family [12,13,14,15]. One does not know a priori the relative contributions of these possible causes of co-variation effects and is thus faced with the complicated inverse problem of using observed correlations to infer contacts between residues (Figure 1). Given alternative causes of true evolutionary co-variation, even if confounding correlations caused by technical reasons can be identified, there is no guarantee that the remaining correlated residue pairs will be dominated by residues in three dimensional proximity. The initial challenge is thus to solve the inverse sequence-tostructure problem by reducing the influence of confounding factors. Only then is it possible to judge whether the evolutionary process reveals enough residue contacts, which are sufficiently evenly distributed (spread) throughout the protein sequence and structure, to predict the protein fold. The ultimate criterion of performance is the accuracy of 3D structure prediction using the inferred contacts. Previous work combined a small number of evolutionarily inferred residue contacts with other, structural, sources of information to successfully predict the structure of some smaller proteins, [16,17,18,19]. However, three crucial open questions remain with respect to using evolutionarily inferred residue-residue couplings for protein fold prediction. The first is whether one can develop a sufficiently robust method to identify causa (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0028766&type=printable
Article home page: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0028766

Debora S. Marks, Lucy J. Colwell, Robert Sheridan, Thomas A. Hopf, Andrea Pagnani, Riccardo Zecchina, Chris Sander. Protein 3D Structure Computed from Evolutionary Sequence Variation, PLOS ONE, 2011, 12, DOI: 10.1371/journal.pone.0028766