Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics, and epigenetics data (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/gigascience/article-pdf/7/3/gix136/24622235/gix136.pdf

Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics, and epigenetics data

GigaScience, 7, 2018, 1–17 doi: 10.1093/gigascience/gix136 Advance Access Publication Date: 16 February 2018 Research RESEARCH Quan H. Nguyen1,2 , Ross L. Tellam1 , Marina Naval-Sanchez1 , Laercio R. Porto-Neto1 , William Barendse3 , Antonio Reverter1 , Benjamin Hayes4 , James Kijas1 and Brian P. Dalrymple1,5,∗ 1 CSIRO Agriculture, 306 Carmody Road, St. Lucia, 4067, QLD, Australia, 2 Divisions of Genomics of Development and Disease, Institute for Molecular Bioscience, University of Queensland, 306 Carmody Road, St. Lucia, 4067, QLD, Australia, 3 School of Veterinary Science, University of Queensland, Veterinary Science Building (8114), Gatton, 4343, QLD, Australia, 4 The Queensland Alliance for Agriculture and Food Innovation (QAAFI), University of Queensland, 306 Carmody Road, St Lucia, 4067, QLD, Australia and 5 Institute of Agriculture, The University of Western Australia, 35 Stirling Highway, Crawley, Perth, Western Australia, 6009, Australia ∗ Correspondence address. Brian P. Dalrymple, Institute of Agriculture, The University of Western Australia, 35 Stirling Highway, Crawley, Perth, Western Australia, 6009, Australia. Tel: +61 7 3425 3580; E-mail: Abstract Genome sequences for hundreds of mammalian species are available, but an understanding of their genomic regulatory regions, which control gene expression, is only beginning. A comprehensive prediction of potential active regulatory regions is necessary to functionally study the roles of the majority of genomic variants in evolution, domestication, and animal production. We developed a computational method to predict regulatory DNA sequences (promoters, enhancers, and transcription factor binding sites) in production animals (cows and pigs) and extended its broad applicability to other mammals. The method utilizes human regulatory features identified from thousands of tissues, cell lines, and experimental assays to find homologous regions that are conserved in sequences and genome organization and are enriched for regulatory elements in the genome sequences of other mammalian species. Importantly, we developed a filtering strategy, including a machine learning classification method, to utilize a very small number of species-specific experimental datasets available to select for the likely active regulatory regions. The method finds the optimal combination of sensitivity and accuracy to unbiasedly predict regulatory regions in mammalian species. Furthermore, we demonstrated the utility of the predicted regulatory datasets in cattle for prioritizing variants associated with multiple production and climate change adaptation traits and identifying potential genome editing targets. Keywords: regulatory genomics; mammalian genome; cattle; pigs; enhancers; promoters; transcription factors; SNP; PLAG1, Poll Received: 8 June 2017; Revised: 7 November 2017; Accepted: 22 December 2017 C The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics, and epigenetics data 2 Nguyen et al. Background Results and Discussion A pipeline for the projection of human genomic features to other mammals The 4 key elements of the HPRS pipeline (Fig. 1) include (1) selection of suitable regulatory datatypes (biochemical assays) and tissues in humans; (2) mapping the selected features to the target species by utilizing conservation of genome organization and sequence identity to maximize coverage without compromising specificity; (3) first round filtering of the mapped regions to retain high-confidence mapped features, which had strict 1to-1 forward and reciprocal mapping and where human features have multiple mappings to the target genome, keeping only those with high sequence identity; and (4) second round filtering by applying a pipeline to utilize available (often limited in scale and coverage) species-specific data to prioritize regions likely to be functional in the target species. Optimizing parameters for mapping sequence features across genomes To identify regions that were likely to be orthologous between genomes, we deployed the liftOver tool [23] and the precomputed alignment files available from UCSC to map regulatory regions in the human genome to the cattle genome based on sequence similarity and genome location. First, we optimized the minMatch mapping threshold of the liftOver tool, which is the minimum proportion of bases to the total length of a region mappable to contiguous aligned segments in the target genome. The minMatch parameter was thoroughly tested with a range from high stringency 0.95 down to 0.1 (Fig. 2). The minMatch Predicting functional features of the genome beyond proteincoding regions has been the primary focus of the post-genome sequencing era [1, 2]. More than 90% of common genetic variants associated with phenotypic variation of complex traits are located in intergenic and intronic regions that regulate gene expression but do not change protein structure [3–5]. Moreover, SNPs associated with diseases such as autoimmune diseases, multiple sclerosis, Crohn’s disease, rheumatoid arthritis, and type 1 diabetes are strikingly enriched in promoters and enhancers [4, 6, 7]. Annotation of functional regions of the genome that harbour SNPs identified by genome-wide association studies (GWAS) to be significantly associated with variation in phenotype will contribute to the identification of functional SNPs and causative mutations, thereby suggesting genetic targets and markers for numerous applications in human health care and agricultural livestock production [8–11]. However, in mammalian species other than the human and mouse, there are few data available at the genome level for discovery of regulatory elements. The recently established Functional Annotation of ANimal Genomes (FAANG) consortium has begun to address this deficiency in a coordinated fashion [12, 13]. It is expected that core assays identifying regulatory elements for key tissues in a number of production animals will be produced by the FAANG consortium and collaborators. However, the information generated in the foreseeable future for livestock is likely to remain far less comprehensive for coverage of tissues, sampling conditions, and breadth of annotation of regulatory elements compared with the human and mouse. The deficiency in the genome-wide prediction of regulatory elements is far greater for most other mammalian species. We have developed a computational method to utilize thousands of human regulatory datasets to predict regulatory elements in important mammalian species. Transcriptional regulatory DNA elements (RDEs) are defined as genomic regions that (...truncated)