On the importance of evolutionary constraint for regulatory sequence identification
Briefings in Functional Genomics, 20(6), 2021, 361–369
https://doi.org/10.1093/bfgp/elab015
Advance Access Publication Date: 23 March 2021
Review Paper
On the importance of evolutionary constraint for
regulatory sequence identification
Corresponding author: Hugues Roest Crollius. E-mail:
Abstract
Regulation of gene expression relies on the activity of specialized genomic elements, enhancers or silencers, distributed
over sometimes large distance from their target gene promoters. A significant part of vertebrate genomes consists in such
regulatory elements, but their identification and that of their target genes remains challenging, due to the lack of clear
signature at the nucleotide level. For many years the main hallmark used for identifying functional elements has been their
sequence conservation between genomes of distant species, indicative of purifying selection. More recently, genome-wide
biochemical assays have opened new avenues for detecting regulatory regions, shifting attention away from evolutionary
constraints. Here, we review the respective contributions of comparative genomics and biochemical assays for the definition
of regulatory elements and their targets and advocate that both sequence conservation and preserved synteny, taken as
signature of functional constraint, remain essential tools in this task.
Key words: comparative genomics; gene regulation; enhancer; vertebrate evolution
Introduction
Understanding how genetic information concealed in DNA
sequence is turned into biological function in live cells,
organisms or ecosystems remains one of the main goals
driving current biological research. In this endeavour, it is of
prime importance to properly identify the functional elements,
which orchestrate the expression of genomes into phenotypes.
Several types of genomic elements affect gene expression:
proximal promoters, which define transcription start sites, distal
enhancers and silencers, which bring regulatory complexes to
the promoters, and insulators, which define the broad genomic
domains where regulatory interactions take place. Here we will
use the term ‘regulatory elements’ exclusively to mean distal cisregulatory elements such as enhancers and silencers. It should
be noted that behind the apparent simplicity of their conceptual
definition, the operational definition of enhancers/silencers, i.e.
the type of experimental evidence required for their validation,
has fluctuated depending on research fields, times or even
people. For example, the original description of enhancers in
cells transfected with episomal vectors demanded that they
function independently of their orientation and distance to the
target [1, 2], requirements that are seldom tested nowadays,
at least in genomic contexts. The difficulties related to the
definition of enhancers are discussed in detail in a recent
review [3].
Indeed, while it is now relatively straightforward to identify
the coding sequences of genes, recognizing regulatory elements
that control their expression remains a much harder task. Two
main reasons account for this contrast. First, coding sequences
reside in exons that need to be transcribed and processed in
characteristic ways (e.g. polyadenylation), which allow for their
specific identification by biochemical isolation (e.g. polyA + RNA
sequencing). In addition, their nucleotide sequence obeys the
universal genetic code, which induces characteristic constraints
on nucleotidic arrangements at small (codon triplets) as well as
large (Open Reading Frame syntax) scales. In contrast, regulatory
elements are more difficult to isolate biochemically and do not
seem to obey a recognizable code at the sequence level.
However, it is generally admitted that the function of regulatory elements is dependent on their nucleotide sequence,
François Giudicelli is an INSERM researcher at the Institut de Biologie de l’École Normale Supérieure (IBENS). His research interests concern the evolution
of gene regulation in vertebrates.
Hugues Roest Crollius is a CNRS researcher and group leader at the Institut de Biologie de l’École Normale Supérieure (IBENS). His lab studies the evolution
of genome organization and function in vertebrates.
© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email:
361
François Giudicelli and Hugues Roest Crollius
362
Giudicelli and Roest Crollius
Phylogenetic footprints define regulatory
elements
In the pre-genomic era, the quest for phylogenetic footprints
was necessarily restricted to small regions around genes of
interest, yielding only scarce and spatially biased knowledge.
For example, a considerable number of studies were dedicated
to the globin loci, paradigmatic cases for the cis-regulatory
control of gene expression by conserved elements [5–7]. With
longer spans of genomes being sequenced in multiple species,
the use of phylogenetic footprinting allowed the identification
of far-acting enhancers located up to 1 Mb from their target
[8–11]. When the 1st complete sequences of vertebrate genomes
became available two decades ago, the field of comparative
genomics rapidly expanded its discoveries. Comparing human
and mouse genomes with the 1st fully sequenced fish genome,
the pufferfish Fugu rupripes, led to the identification of several
thousands highly conserved non-coding elements, named CNEs,
which survived 450 M years of diverging evolution [12]. Similar
findings were reported upon comparison of the human and
elephant shark genomes [13], albeit without experimental
confirmation in this case. Here, evolutionary constraints could
be observed since Gnathostomata, the last common ancestor
of humans and elephant sharks, which lived 530 M years ago.
The initial focus was indeed placed on conservation over such
extreme evolutionary distances as the most reliable indicator
of regulatory function. This focus on ancient conservation
helped install the zebrafish as one of the key species used in
vertebrate enhancer studies, mainly because of its status as
widespread experimental model for developmental biology. Not
only could it be used to infer regulatory regions, but it also
allowed experimental validation of these inferences using a
wide range of transgenesis techniques [14]. It should be noted
however that teleosts fishes like zebrafish, fugu or medaka,
which make one of the largest vertebrate groups with ∼25 000
species, may not be the most appropriate basal group to study
the gene regulatory networks underlying the developmental
programme of vertebrates. According to [15], the accelerated
evolution of their genomes that followed a whole-genome
duplication event at the root of teleosts led to loss of many
ancient regulatory elements, which may explain why Venkatesh
et al. [13] found more CNEs conserved between human and
elephant shark (4782) than between human and the less distant
fugu (2107) or zebrafish (2938) with the same sequence similarity
thresholds.
The reason w (...truncated)