SeqSIMLA: a sequence and phenotype simulation tool for complex disease studies
Ren-Hua Chung
0
Chung-Chin Shih
0
0
Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes
,
Zhunan, Miaoli
,
Taiwan
Results: We developed a simulation tool, SeqSIMLA, which can simulate sequence data with user-specified disease and quantitative trait models. We implemented two disease models, in which the user can flexibly specify the number of disease loci, effect sizes or population attributable risk, disease prevalence, and risk or protective loci. We also implemented a quantitative trait model, in which the user can specify the number of quantitative trait loci (QTL), proportions of variance explained by the QTL, and genetic models. We compiled recombination rates from the HapMap project so that genomic structures similar to the real data can be simulated. Conclusions: SeqSIMLA can efficiently simulate sequence data with disease or quantitative trait models specified by the user. SeqSIMLA will be very useful for evaluating statistical properties for new study designs and new statistical methods using NGS. SeqSIMLA can be downloaded for free at http://seqsimla.sourceforge.net.
-
Background
Computer programs that can simulate genotypes with
phenotypes based on user-specified disease or quantitative trait
models are essential in genetic studies. They can be used
to evaluate statistical power when planning a study design
based on the proposed sample size, the assumed genotypic
relative risks (GRR), and allele frequencies. They are also
useful for evaluating type I error rates for new statistical
association tests and power comparisons between the new
tests and other existing tests. Therefore, many simulation
programs have been developed, mostly aiming to generate
genome-wide association study (GWAS) data with
dichotomous or quantitative traits [1-5].
Next-generation sequencing (NGS) has become a
popular technique for identifying novel rare variants associated
with complex diseases [6]. Statistical association tests that
can account for rare variants have also been developed
rapidly [7-10]. These tests aim to identify multiple rare
causal variants in a group of variants selected by biological
functions, such as exons, genes, and pathways. A common
approach is to pool all the variants in the group to
increase the statistical power for associations. To evaluate
the statistical power for new tests, a simulation tool that
can simulate multiple rare casual variants based on
sequence data is necessary. However, simulation programs
developed for GWAS may not be appropriate for
evaluating statistical properties for NGS studies, because they
were designed to simulate common variants based on
GWAS panels (e.g., Illumina and Affymetrix) or HapMap
project data [11]. Thus, computer software that can
simulate sequence data based on realistic models with
phenotypes becomes important.
To our knowledge, SimRare is the only existing public
software designed specifically to simulate sequence data
with phenotypes [12]. SimRare has three modules,
including a sequence generation module, a module for
phenotype generation based on genotypes, and a module for
evaluating association methods. The forward-time
simulation algorithm [5,13] is used in SimRare to generate
variant data. SimRare focuses on generating unrelated
samples and on evaluating association methods developed
for unrelated samples. As more and more family-based
association studies using NGS are conducted [14-17],
software that can generate sequence data in families will be
very useful for evaluating the properties of family-based
NGS analysis.
We developed the Sequence and phenotype Simulator,
SeqSIMLA, which can simulate sequence data in
unrelated casecontrol or family samples with user-specified
disease or quantitative trait models. SeqSIMLA uses
GENOME [18] as the default sequence generator, which
efficiently generates data using the coalescent model.
SeqSIMLA also accepts a population of sequences
generated by other sequence generators. SeqSIMLA can
simulate multiple causal variants in regions on different
chromosomes, where the recombination rates between
regions are based on the rates estimated from the Hap
Map project [11] or a user-specified fixed rate. We
compared the features between SeqSIMLA and SimRare and
used simulations to demonstrate that SeqSIMLA can
generate data in a reasonable time frame.
Implementation
Sequence generation
GENOME is used as the default tool to simulate a
population of sequences based on the coalescent model.
Alternatively, as other sequence simulators can have their own
unique features, SeqSIMLA also accepts a population of
sequences generated by other programs. GENOME either
accepts different recombination rates among
chromosomal blocks or assumes a fixed rate across the genome.
There is no recombination within each of the
chromosomal blocks. To simulate block structures similar to real
populations, we downloaded the recombination hotspots
across the genome from the HapMap project [11], with
the highest recombination rate in each hotspot region
used as the recombination rate for the center of the
hotspot. Crossovers during meiosis are simulated based
on the recombination rates for the centers of hotspots.
Alternatively, the user can assume that the recombination
rates are uniform across the chromosomes, which is the
default setting in GENOME.
Disease models
We do not have restrictions on the number of disease
loci to be simulated. A logistic function as follows is
used to calculate the penetrance:
where X = (G1,G2,,Gn) is a vector of genotype coding
for n disease variants, B= (1, 2, , n) is a vector of the
conditional log of odds ratios for the associated
genotypes, and determines the disease prevalence K. The
parameter is ln (f0/(1 f0)), which is the log odds of
the penetrance for X with no mutant alleles. The odds
ratio ei represents the increased odds for the disease for
an additional mutant allele at variant i [19]. For the
prevalence model (Model 1), the disease prevalence K is
specified by the user. We iteratively search for in the
range between 20 and 20 and calculate disease
prevalence Ki based on i in iteration i. The value i is
selected for if |Ki K| < , where is small (e.g., 0.001).
Alternatively, the user can specify f0 directly, and uses
the population attributable risk (PAR) to determine the
GRRs for the disease loci (the PAR model or Model 2).
The logistic function can be represented by the function
of f0 and GRR :
1f 0f 0 in1GRRik
P Affected X
f 0 in1GRRik.
where f0 is the baseline penetrance specified by the
user, GRRi is the GRR for the genotype at marker i, PARi
is the population attributable risk, and Ri is the risk
allele frequency for marker i. The sum of PARi for the
disease loci is equal to the overall PAR specified by the
user. The parameter k is coded as the number of mutant
allele counts (0, 1, 2) for an additive model, the
presence/absence of an mutant allele (2/0) for a dominant (...truncated)