The Rapid Evolution of De Novo Proteins in Structure and Complex
GBE
The Rapid Evolution of De Novo Proteins in Structure
and Complex
Jianhai Chen
Dong Wang
1,
*, Qingrong Li 2,3, Shengqian Xia
*, and Manyuan Long 1,*
2,3,
1
, Deanna Arsala
1
, Dylan Sosa
1
,
1
Department of Ecology and Evolution, The University of Chicago, Chicago, IL 60637, USA
2
Division of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla,
3
Department of Cellular & Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, CA 92093, USA
*Corresponding authors: E-mails: ; ; .
Accepted: May 10, 2024
Abstract
Recent studies in the rice genome-wide have established that de novo genes, evolving from noncoding sequences, enhance
protein diversity through a stepwise process. However, the pattern and rate of their evolution in protein structure over time
remain unclear. Here, we addressed these issues within a surprisingly short evolutionary timescale (<1 million years for 97%
of Oryza de novo genes) with comparative approaches to gene duplicates. We found that de novo genes evolve faster than
gene duplicates in the intrinsically disordered regions (such as random coils), secondary structure elements (such as α helix
and β strand), hydrophobicity, and molecular recognition features. In de novo proteins, specifically, we observed an 8%
to 14% decay in random coils and intrinsically disordered region lengths and a 2.3% to 6.5% increase in structured elements,
hydrophobicity, and molecular recognition features, per million years on average. These patterns of structural evolution align
with changes in amino acid composition over time as well. We also revealed higher positive charges but smaller molecular
weights for de novo proteins than duplicates. Tertiary structure predictions showed that most de novo proteins, though
not typically well folded on their own, readily form low-energy and compact complexes with other proteins facilitated by ex
tensive residue contacts and conformational flexibility, suggesting a faster-binding scenario in de novo proteins to promote
interaction. These analyses illuminate a rapid evolution of protein structure in de novo genes in rice genomes, originating
from noncoding sequences, highlighting their quick transformation into active, protein complex-forming components within
a remarkably short evolutionary timeframe.
Key words: de novo genes, gene duplicates, structural evolution, protein complex, new genes.
Significance
The structural evolution of de novo proteins remains a fundamentally important question for understanding the evolu
tion of molecular functions of de novo genes. We detected a rapid evolution of protein structure in de novo genes of
Oryza on a surprisingly short timescale.
Introduction
The complexity and adaptability of biological functions of
ten find their roots in the ever-evolving genetic systems.
Important to this is the emergence of de novo genes
(Long et al. 2003; Alba and Castresana 2005; Levine
et al. 2006; McLysaght and Hurst 2016)—genes that
arise from regions of DNA once categorized as the
“junk” that used to be considered functionally insignificant
© The Author(s) 2024. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse,
distribution, and reproduction in any medium, provided the original work is properly cited.
Genome Biol. Evol. 16(6) https://doi.org/10.1093/gbe/evae107 Advance Access publication 16 May 2024
1
CA 92093, USA
GBE
Chen et al.
evolved domains (Bitard-Feildel et al. 2015; Basile et al. 2017;
Wilson et al. 2017; Heames et al. 2020; Lange et al. 2021;
Heames et al. 2023). Conversely, other studies present incon
sistent results due to different average disorders in different
species (Ekman and Elofsson 2010; Schmitz et al. 2018;
Vakirlis et al. 2018). The question of whether ISD is influenced
by gene age or if it can evolve over time remains unresolved.
Additionally, the evolvability of well-folded structural ele
ments in de novo genes, such as, 310 helices, α helices, and
β strands, remains an open question. Are the amino acid
compositions of de novo proteins optimized for structural sta
bility over time? Recently, AlphaFold2 stands as the leading
deep learning tool for predicting protein structures utilizing
coevolutionary information from multiple sequence align
ments (Jumper et al. 2021). MD (molecular dynamics) simula
tion studies have revealed that most de novo proteins are
flexible in structure and a minority of them adopt well-known
protein structures (Middendorf and Eicholt 2024; Peng and
Zhao 2024). Despite the tendency of de novo proteins to
be disordered with few (or no) orthologs, AlphaFold2’s pre
dictions reveal that they generally achieve higher-confidence
scores per residue (predicted local distance difference test
[pLDDT]) than random sequences (Middendorf et al. 2024).
The AlphaFold2 performs the MD refinement (called “relax”
in AlphaFold2 terminology) using OpenMM (Jumper et al.
2021). In addition, a benchmarking study based on 2,613
proteins with experimentally determined structures indicates
that AlphaFold2 is a good predictor of the structure of loop
regions (regions of neither α helices nor β strands), especially
for short loop regions (Stevens and He 2022). The pLDDT
score is an excellent metric for assessing modeling confi
dence, disorder levels, and structural variability (Saldaño
et al. 2022; Wilson et al. 2022), with AlphaFold2 demonstrat
ing a significant correlation between pLDDT scores and the
presence of secondary structures in disorder-rich proteins,
both globally and locally (Wilson et al. 2022). Recent studies
showed that model quality can be estimated by generating
many structure models for the same protein and quantifying
the structural similarities among the models by TM (template
modeling) score (Mukherjee and Zhang 2009; Peng and Zhao
2024). These findings suggest AlphaFold2’s pivotal role in
elucidating the biological implications of de novo proteins,
which are predominantly characterized by variable structural
changes.
Another rising question is whether or how de novo pro
teins, which are often very short, interact with other usually
larger proteins and their ability to form complexes with other
biomolecules. Indeed, roughly 40% of all protein–protein in
teractions are between proteins and shorter peptides, many
of which play critical roles in cellular life-cycle functions
(Lee et al. 2019). Recent advances like AlphaFold-multimer
excel in predicting peptide–protein interactions (JohanssonÅkhe and Wallner 2022), which could facilitate our under
standing on the evolution of de novo protein and potential
2 Genome Biol. Evol. 16(6) https://doi.org/10.1093/gbe/evae107 Advance Access publication 16 May 2024
( (...truncated)