Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer
D1066–D1072 Nucleic Acids Research, 2019, Vol. 47, Database issue
doi: 10.1093/nar/gky814
Published online 7 September 2018
Pancan-meQTL: a database to systematically evaluate
the effects of genetic variants on methylation in
human cancer
Jing Gong1,* , Hao Wan1 , Shufang Mei1 , Hang Ruan2 , Zhao Zhang2 , Chunjie Liu3 ,
An-Yuan Guo3 , Lixia Diao4,* , Xiaoping Miao1,* and Leng Han2,*
Department of Epidemiology and Biostatistics, Key Laboratory of Environmental Health of Ministry of Education,
School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei
430030, PR China, 2 Department of Biochemistry and Molecular Biology, The University of Texas Health Science
Center at Houston McGovern Medical School, Houston, TX 77030, USA, 3 Department of Bioinformatics and Systems
Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei
430074, PR China and 4 Department of Bioinformatics and Computational Biology, The University of Texas MD
Anderson Cancer Center, Houston, TX 77030, USA
Received July 23, 2018; Revised August 22, 2018; Editorial Decision August 30, 2018; Accepted August 30, 2018
ABSTRACT
INTRODUCTION
DNA methylation is an important epigenetic mechanism for regulating gene expression. Aberrant DNA
methylation has been observed in various human
diseases, including cancer. Single-nucleotide polymorphisms can contribute to tumor initiation, progression and prognosis by influencing DNA methylation, and DNA methylation quantitative trait loci
(meQTL) have been identified in physiological and
pathological contexts. However, no database has
been developed to systematically analyze meQTLs
across multiple cancer types. Here, we present
Pancan-meQTL, a database to comprehensively provide meQTLs across 23 cancer types from The Cancer Genome Atlas by integrating genome-wide genotype and DNA methylation data. In total, we identified
8 028 964 cis-meQTLs and 965 050 trans-meQTLs.
Among these, 23 432 meQTLs are associated with
patient overall survival times. Furthermore, we identified 2 214 458 meQTLs that overlap with known loci
identified through genome-wide association studies.
Pancan-meQTL provides a user-friendly web interface (http://bioinfo.life.hust.edu.cn/Pancan-meQTL/)
that is convenient for browsing, searching and downloading data of interest. This database is a valuable
resource for investigating the roles of genetics and
epigenetics in cancer.
The interpretation of the function of genomic variants, particularly in non-coding regions, is a major challenge for
the genetic dissection of complex diseases such as cancer
(1). Genome-wide association studies (GWAS) have identified numerous genetic loci that influence the risk of human cancer (2,3), but most of these loci are located in noncoding regions and are without clear molecular mechanisms
that contribute to the phenotypic outcome. Previous studies considered a diverse set of functional regions, including
miRNA binding sites, protein modification sites and transcription factor binding sites (4,5). However, the link between variants and epigenetic signals involved in the regulation of key biological processes has been largely overlooked.
As a major epigenetic mechanism that directs gene expression, DNA methylation plays a key role in the regulation of crucial biological and pathological processes (6).
Aberrant DNA methylation is frequently observed in various cancers (7) and represents an attractive biomarker and
therapeutic target (8,9). Increasing evidence indicates that
single-nucleotide polymorphisms (SNPs) contribute to tumor initiation, progression and prognosis by influencing
DNA methylation levels (10,11). Therefore, DNA methylation may be an important molecular-level phenotype that
links a genotype with the trait of a complex disease. It is fundamentally vital to build a public data repository to identify SNPs that significantly affect DNA methylation levels, i.e. methylation quantitative trait loci (meQTL). Recent
methodological advances allow for genome-wide screening
of meQTLs in different tissues, including blood (12), lung
* To whom correspondence should be addressed. Tel: +86 27 8365 0744; Email:
Correspondence may also be addressed to Xiaoping Miao. Tel: +86 27 8365 0744; Email:
Correspondence may also be addressed to Lixia Diao. Email:
Correspondence may also be addressed to Leng Han. Tel: +1 713 500 6039; Email:
C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
1
Nucleic Acids Research, 2019, Vol. 47, Database issue D1067
DATA COLLECTION AND PROCESSING
Genotype data collection, imputation and processing
We downloaded genotype data (level 2) from TCGA data
portal (https://portal.gdc.cancer.gov/) (Figure 1A). We kept
7735 samples with both genotype data and methylation
data. We then combined colon adenocarcinoma (COAD)
and rectum adenocarcinoma (READ) as colorectal cancer (CRC) (15) and removed cancer types with sample size
<100 primary tumor samples. Thus, for further analysis,
we had 7242 samples across 23 cancer types. We performed
genotype imputation and filtering per cancer type as described in our previous study (16). After imputation and
quality filtering, on average, 4 318 218 genotypes per cancer type were included in the meQTL analysis.
Methylation data collection and processing
Methylation beta values (level 3) obtained from TCGA
data portal (https://gdc-portal.nci.nih.gov/) were measured
by the Illumina Infinium HumanMethylation450 BeadChip
array, which contained 485 512 probes for each sample.
Due to the specific nature of methylation patterns on sex
chromosomes (17), we focused on autosomes. In each cancer type, probes were filtered by the following criteria: (i)
methylation beta value missing rate > 0.05, (ii) mapping to
multiple locations on the genome (18) and (iii) containing
known SNP (1000 Genome Phase3 (19), MAF > 0.01) at
CpG sites (20,21) (Figure 1B). On average, 369 244 highquality methylation probes per cancer type were used for
analyses. To minimize the effects of outliers on the regression scores, the values for each probe across samples per
cancer type were transformed into a standard normal distribution based on rank (17,22,23).
Covariates
To correct for known and unknown confounders and increase the sensitivity of our analyses, we included several
covariates. The top five principal components calculated by
smartpca in the EIGENSOFT program (24) were included
to control for ethnicity differences. To remove hidden batch
effects and other confounders in the methylation data, we
used PEER software (25) to select the first 15 PEER factors from the (...truncated)