Populational landscape of INDELs affecting transcription factor-binding sites in humans

BMC Genomics, Jul 2015

Background Differences in gene expression have a significant role in the diversity of phenotypes in humans. Here we integrated human public data from ENCODE, 1000 Genomes and Geuvadis to explore the populational landscape of INDELs affecting transcription factor-binding sites (TFBS). A significant fraction of TFBS close to the transcription start site of known genes is affected by INDELs with a consequent effect at the expression of the associated gene. Results Hundreds of TFBS-affecting INDELs (TFBS-ID) show a differential frequency between human populations, suggesting a role of natural selection in the spread of such variant INDELs. A comparison with a dataset of known human genomic regions under natural selection allowed us to identify several cases of TFBS-ID likely involved in populational adaptations. Ontology analyses on the differential TFBS-ID further indicated several biological processes under natural selection in different populations. Conclusion Together, our results strongly suggest that INDELs have an important role in modulating gene expression patterns in humans. The dataset we make available, together with other data reporting variability at both regulatory and coding regions of genes, represent a powerful tool for studies aiming to better understand the evolution of gene regulatory networks in humans.

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/s12864-015-1744-5.pdf

Populational landscape of INDELs affecting transcription factor-binding sites in humans

Ribeiro-dos-Santos et al. BMC Genomics (2015) 16:536 DOI 10.1186/s12864-015-1744-5 RESEARCH ARTICLE Open Access Populational landscape of INDELs affecting transcription factor-binding sites in humans André M. Ribeiro-dos-Santos1, Vandeclécio L. da Silva1,2, Jorge E.S. de Souza2,3 and Sandro J. de Souza4* Abstract Background: Differences in gene expression have a significant role in the diversity of phenotypes in humans. Here we integrated human public data from ENCODE, 1000 Genomes and Geuvadis to explore the populational landscape of INDELs affecting transcription factor-binding sites (TFBS). A significant fraction of TFBS close to the transcription start site of known genes is affected by INDELs with a consequent effect at the expression of the associated gene. Results: Hundreds of TFBS-affecting INDELs (TFBS-ID) show a differential frequency between human populations, suggesting a role of natural selection in the spread of such variant INDELs. A comparison with a dataset of known human genomic regions under natural selection allowed us to identify several cases of TFBS-ID likely involved in populational adaptations. Ontology analyses on the differential TFBS-ID further indicated several biological processes under natural selection in different populations. Conclusion: Together, our results strongly suggest that INDELs have an important role in modulating gene expression patterns in humans. The dataset we make available, together with other data reporting variability at both regulatory and coding regions of genes, represent a powerful tool for studies aiming to better understand the evolution of gene regulatory networks in humans. Keywords: Transcription factor, Transcription factor-binding site, INDEL, Population genetics Background Much has been debated about the evolutionary role of genetic alterations in the regulation of gene expression [1–7]. In that aspect, transcription factor binding sites (TFBS) have recently been studied both in humans and other animals [8–10]. Several genome-wide analyses have identified regions close to genes (usually enriched with TFBS) showing patterns of diversity in accordance with a model of positive selection [1, 10]. In a recent study, Arbiza et al. [1] found that TFBS are under weaker selection than protein-coding regions of genes although these authors could observe several instances of adaptation in TFBS. In a similar way, Vernot et al. [10] have found hundreds of variations that are adaptive. Although these studies have shed some light on the evolutionary forces acting on TFBS and other regulatory elements, several issues remain poorly explored or even * Correspondence: 4 Brain Institute, UFRN, Av. Nascimento de Castro, 2155 - 59056-450, Natal, RN, Brazil Full list of author information is available at the end of the article unexplored. One of them is the role of INDELs (insertion/ deletion) as a source of genetic variability among TFBS. Most of the few populational studies in this area are biased towards single nucleotide variants (SNV) [3, 9, 11]. Based on that, we decided to explore this issue by using three types of data recently made public. First, whole-genome sequences of more than a thousand human individuals from the 1000 Genomes Project (TGP) [12] were used to identify polymorphic INDELs. Second, a genome-wide identification of TFBS for 148 transcription factors from the ENCODE (Encyclopedia of DNA Elements) Project [13] was used to generate a catalogue of TFBS in the human genome. Finally, expression data from a sub-set of individuals from the 1000 Genome Project [14] was used to evaluate the impact of TFBS-affecting INDELs (TFBSID) on the expression of the corresponding gene. Integration of all these data allowed us to show a high frequency of TFBS-ID in the human genome. Hundreds of TFBS-ID showed a differential frequency in human populations and ontology analyses of these cases suggested a role of natural selection and population history in their distribution. © 2015 Ribeiro-dos-Santos et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ribeiro-dos-Santos et al. BMC Genomics (2015) 16:536 Based on that, we argue that a TFBS-ID has been selected in Africans by down-regulating APIP (APAF1-interacting protein) and generating a better response to Salmonella infection. A comparative analysis with genomic regions, known to be under positive selection [15], revealed that a significant fraction of the TFBS-ID identified by us represent instances of adaptation in human populations. Results and Discussion Identification of TFBS-ID Fig. 1 shows a schematic representation of the computational pipeline used in all analyses reported here. To build a catalogue of TFBS-ID, we first indexed all TFBS identified by the ENCODE project in the human reference Page 2 of 11 genome (hg19 version). Data from the 1000 Genomes project regarding the position of INDELs in the reference genome was then compared to the position of TFBS and those cases in which an INDEL overlapped with a TFBS were selected. This strategy rendered us a total of 259,864 TFBS affected by at least one INDEL. Since a significant fraction of TFBS overlap at the sequence level, the nonredundant number of TFBS-ID in the above set was 100,182 (an average of 2.59 TFBS per INDEL). Due to the presence of long INDELs affecting many TFBS at once, we decided to limit our analysis to those INDELs shorter than 200 bp, which gave us a total of 99,642 TFBS-ID and 258,686 TFBS. Although the superior limit was set to 200 bp, the final set of 99,642 TFBS-ID is strongly biased Fig. 1 Analysis overview. Schematic representation of the strategy used here to identify and analyse TFBS affected by polymorphic INDELs in human populations Ribeiro-dos-Santos et al. BMC Genomics (2015) 16:536 towards shorter indels. More than 99.8 % of all indels were equal or shorter than 20 bp. Next, TFBS-ID close (≤5 KB) to the transcription start site (TSS) of known human genes (as defined by the Reference Sequence set) were selected. In total, 7,313 human genes had at least one TFBS affected by a polymorphic INDEL in the 1000 Genomes dataset. This set of 7,313 genes had a total of 38,339 TFBS affected by INDELs and 10,528 TFBS-ID. A complete list of this dataset is available at Additional file 1: Table S1. Since many reports have also used a window that flanks the TSS of known genes [16,17], we have also defined a different window of same size (5 KB) now encompassing 2,5 KB in each side of a given TSS. For this window, we found that 9,733 human genes had at least one TFBS affected (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/s12864-015-1744-5.pdf
Article home page: http://www.biomedcentral.com/1471-2164/16/536

André Ribeiro-dos-Santos, Vandeclécio da Silva, Jorge de Souza, Sandro de Souza. Populational landscape of INDELs affecting transcription factor-binding sites in humans, BMC Genomics, 2015, pp. 536, 16, DOI: 10.1186/s12864-015-1744-5