Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics, and epigenetics data
GigaScience, 7, 2018, 1–17
doi: 10.1093/gigascience/gix136
Advance Access Publication Date: 16 February 2018
Research
RESEARCH
Quan H. Nguyen1,2 , Ross L. Tellam1 , Marina Naval-Sanchez1 ,
Laercio R. Porto-Neto1 , William Barendse3 , Antonio Reverter1 ,
Benjamin Hayes4 , James Kijas1 and Brian P. Dalrymple1,5,∗
1
CSIRO Agriculture, 306 Carmody Road, St. Lucia, 4067, QLD, Australia, 2 Divisions of Genomics of Development
and Disease, Institute for Molecular Bioscience, University of Queensland, 306 Carmody Road, St. Lucia, 4067,
QLD, Australia, 3 School of Veterinary Science, University of Queensland, Veterinary Science Building (8114),
Gatton, 4343, QLD, Australia, 4 The Queensland Alliance for Agriculture and Food Innovation (QAAFI),
University of Queensland, 306 Carmody Road, St Lucia, 4067, QLD, Australia and 5 Institute of Agriculture, The
University of Western Australia, 35 Stirling Highway, Crawley, Perth, Western Australia, 6009, Australia
∗
Correspondence address. Brian P. Dalrymple, Institute of Agriculture, The University of Western Australia, 35 Stirling Highway, Crawley, Perth, Western
Australia, 6009, Australia. Tel: +61 7 3425 3580; E-mail:
Abstract
Genome sequences for hundreds of mammalian species are available, but an understanding of their genomic regulatory
regions, which control gene expression, is only beginning. A comprehensive prediction of potential active regulatory
regions is necessary to functionally study the roles of the majority of genomic variants in evolution, domestication, and
animal production. We developed a computational method to predict regulatory DNA sequences (promoters, enhancers,
and transcription factor binding sites) in production animals (cows and pigs) and extended its broad applicability to other
mammals. The method utilizes human regulatory features identified from thousands of tissues, cell lines, and
experimental assays to find homologous regions that are conserved in sequences and genome organization and are
enriched for regulatory elements in the genome sequences of other mammalian species. Importantly, we developed a
filtering strategy, including a machine learning classification method, to utilize a very small number of species-specific
experimental datasets available to select for the likely active regulatory regions. The method finds the optimal combination
of sensitivity and accuracy to unbiasedly predict regulatory regions in mammalian species. Furthermore, we demonstrated
the utility of the predicted regulatory datasets in cattle for prioritizing variants associated with multiple production and
climate change adaptation traits and identifying potential genome editing targets.
Keywords: regulatory genomics; mammalian genome; cattle; pigs; enhancers; promoters; transcription factors; SNP;
PLAG1, Poll
Received: 8 June 2017; Revised: 7 November 2017; Accepted: 22 December 2017
C The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium,
provided the original work is properly cited.
1
Mammalian genomic regulatory regions predicted
by utilizing human genomics, transcriptomics, and
epigenetics data
2
Nguyen et al.
Background
Results and Discussion
A pipeline for the projection of human genomic
features to other mammals
The 4 key elements of the HPRS pipeline (Fig. 1) include (1) selection of suitable regulatory datatypes (biochemical assays) and
tissues in humans; (2) mapping the selected features to the target species by utilizing conservation of genome organization
and sequence identity to maximize coverage without compromising specificity; (3) first round filtering of the mapped regions
to retain high-confidence mapped features, which had strict 1to-1 forward and reciprocal mapping and where human features have multiple mappings to the target genome, keeping
only those with high sequence identity; and (4) second round
filtering by applying a pipeline to utilize available (often limited
in scale and coverage) species-specific data to prioritize regions
likely to be functional in the target species.
Optimizing parameters for mapping sequence features
across genomes
To identify regions that were likely to be orthologous between
genomes, we deployed the liftOver tool [23] and the precomputed alignment files available from UCSC to map regulatory
regions in the human genome to the cattle genome based on
sequence similarity and genome location. First, we optimized
the minMatch mapping threshold of the liftOver tool, which is
the minimum proportion of bases to the total length of a region
mappable to contiguous aligned segments in the target genome.
The minMatch parameter was thoroughly tested with a range
from high stringency 0.95 down to 0.1 (Fig. 2). The minMatch
Predicting functional features of the genome beyond proteincoding regions has been the primary focus of the post-genome
sequencing era [1, 2]. More than 90% of common genetic variants associated with phenotypic variation of complex traits are
located in intergenic and intronic regions that regulate gene expression but do not change protein structure [3–5]. Moreover,
SNPs associated with diseases such as autoimmune diseases,
multiple sclerosis, Crohn’s disease, rheumatoid arthritis, and
type 1 diabetes are strikingly enriched in promoters and enhancers [4, 6, 7]. Annotation of functional regions of the genome
that harbour SNPs identified by genome-wide association studies (GWAS) to be significantly associated with variation in phenotype will contribute to the identification of functional SNPs
and causative mutations, thereby suggesting genetic targets and
markers for numerous applications in human health care and
agricultural livestock production [8–11].
However, in mammalian species other than the human and
mouse, there are few data available at the genome level for discovery of regulatory elements. The recently established Functional Annotation of ANimal Genomes (FAANG) consortium has
begun to address this deficiency in a coordinated fashion [12,
13]. It is expected that core assays identifying regulatory elements for key tissues in a number of production animals will be
produced by the FAANG consortium and collaborators. However,
the information generated in the foreseeable future for livestock
is likely to remain far less comprehensive for coverage of tissues, sampling conditions, and breadth of annotation of regulatory elements compared with the human and mouse. The deficiency in the genome-wide prediction of regulatory elements
is far greater for most other mammalian species. We have developed a computational method to utilize thousands of human
regulatory datasets to predict regulatory elements in important
mammalian species.
Transcriptional regulatory DNA elements (RDEs) are defined
as genomic regions that (...truncated)