Populational landscape of INDELs affecting transcription factor-binding sites in humans
Ribeiro-dos-Santos et al. BMC Genomics (2015) 16:536
DOI 10.1186/s12864-015-1744-5
RESEARCH ARTICLE
Open Access
Populational landscape of INDELs affecting
transcription factor-binding sites in humans
André M. Ribeiro-dos-Santos1, Vandeclécio L. da Silva1,2, Jorge E.S. de Souza2,3 and Sandro J. de Souza4*
Abstract
Background: Differences in gene expression have a significant role in the diversity of phenotypes in humans.
Here we integrated human public data from ENCODE, 1000 Genomes and Geuvadis to explore the populational
landscape of INDELs affecting transcription factor-binding sites (TFBS). A significant fraction of TFBS close to the
transcription start site of known genes is affected by INDELs with a consequent effect at the expression of the
associated gene.
Results: Hundreds of TFBS-affecting INDELs (TFBS-ID) show a differential frequency between human populations,
suggesting a role of natural selection in the spread of such variant INDELs. A comparison with a dataset of known
human genomic regions under natural selection allowed us to identify several cases of TFBS-ID likely involved in
populational adaptations. Ontology analyses on the differential TFBS-ID further indicated several biological processes
under natural selection in different populations.
Conclusion: Together, our results strongly suggest that INDELs have an important role in modulating gene expression
patterns in humans. The dataset we make available, together with other data reporting variability at both regulatory
and coding regions of genes, represent a powerful tool for studies aiming to better understand the evolution of gene
regulatory networks in humans.
Keywords: Transcription factor, Transcription factor-binding site, INDEL, Population genetics
Background
Much has been debated about the evolutionary role of
genetic alterations in the regulation of gene expression
[1–7]. In that aspect, transcription factor binding sites
(TFBS) have recently been studied both in humans and
other animals [8–10]. Several genome-wide analyses
have identified regions close to genes (usually enriched
with TFBS) showing patterns of diversity in accordance
with a model of positive selection [1, 10]. In a recent
study, Arbiza et al. [1] found that TFBS are under
weaker selection than protein-coding regions of genes
although these authors could observe several instances
of adaptation in TFBS. In a similar way, Vernot et al.
[10] have found hundreds of variations that are adaptive.
Although these studies have shed some light on the
evolutionary forces acting on TFBS and other regulatory
elements, several issues remain poorly explored or even
* Correspondence:
4
Brain Institute, UFRN, Av. Nascimento de Castro, 2155 - 59056-450, Natal, RN,
Brazil
Full list of author information is available at the end of the article
unexplored. One of them is the role of INDELs (insertion/
deletion) as a source of genetic variability among TFBS.
Most of the few populational studies in this area are biased
towards single nucleotide variants (SNV) [3, 9, 11]. Based
on that, we decided to explore this issue by using three
types of data recently made public. First, whole-genome
sequences of more than a thousand human individuals
from the 1000 Genomes Project (TGP) [12] were used to
identify polymorphic INDELs. Second, a genome-wide
identification of TFBS for 148 transcription factors from
the ENCODE (Encyclopedia of DNA Elements) Project
[13] was used to generate a catalogue of TFBS in the
human genome. Finally, expression data from a sub-set of
individuals from the 1000 Genome Project [14] was used
to evaluate the impact of TFBS-affecting INDELs (TFBSID) on the expression of the corresponding gene. Integration of all these data allowed us to show a high frequency
of TFBS-ID in the human genome. Hundreds of TFBS-ID
showed a differential frequency in human populations and
ontology analyses of these cases suggested a role of natural
selection and population history in their distribution.
© 2015 Ribeiro-dos-Santos et al. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain
Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article,
unless otherwise stated.
Ribeiro-dos-Santos et al. BMC Genomics (2015) 16:536
Based on that, we argue that a TFBS-ID has been selected
in Africans by down-regulating APIP (APAF1-interacting
protein) and generating a better response to Salmonella
infection. A comparative analysis with genomic regions,
known to be under positive selection [15], revealed that a
significant fraction of the TFBS-ID identified by us represent instances of adaptation in human populations.
Results and Discussion
Identification of TFBS-ID
Fig. 1 shows a schematic representation of the computational pipeline used in all analyses reported here. To build
a catalogue of TFBS-ID, we first indexed all TFBS identified by the ENCODE project in the human reference
Page 2 of 11
genome (hg19 version). Data from the 1000 Genomes project regarding the position of INDELs in the reference
genome was then compared to the position of TFBS and
those cases in which an INDEL overlapped with a TFBS
were selected. This strategy rendered us a total of 259,864
TFBS affected by at least one INDEL. Since a significant
fraction of TFBS overlap at the sequence level, the nonredundant number of TFBS-ID in the above set was
100,182 (an average of 2.59 TFBS per INDEL). Due to the
presence of long INDELs affecting many TFBS at once, we
decided to limit our analysis to those INDELs shorter than
200 bp, which gave us a total of 99,642 TFBS-ID and
258,686 TFBS. Although the superior limit was set to
200 bp, the final set of 99,642 TFBS-ID is strongly biased
Fig. 1 Analysis overview. Schematic representation of the strategy used here to identify and analyse TFBS affected by polymorphic INDELs in
human populations
Ribeiro-dos-Santos et al. BMC Genomics (2015) 16:536
towards shorter indels. More than 99.8 % of all indels were
equal or shorter than 20 bp. Next, TFBS-ID close (≤5 KB)
to the transcription start site (TSS) of known human
genes (as defined by the Reference Sequence set) were selected. In total, 7,313 human genes had at least one TFBS
affected by a polymorphic INDEL in the 1000 Genomes
dataset. This set of 7,313 genes had a total of 38,339 TFBS
affected by INDELs and 10,528 TFBS-ID. A complete list
of this dataset is available at Additional file 1: Table S1.
Since many reports have also used a window that flanks
the TSS of known genes [16,17], we have also defined a
different window of same size (5 KB) now encompassing
2,5 KB in each side of a given TSS. For this window,
we found that 9,733 human genes had at least one
TFBS affected (...truncated)