SeqSIMLA: a sequence and phenotype simulation tool for complex disease studies (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-14-199.pdf

SeqSIMLA: a sequence and phenotype simulation tool for complex disease studies

Ren-Hua Chung 0 Chung-Chin Shih 0 0 Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes , Zhunan, Miaoli , Taiwan Results: We developed a simulation tool, SeqSIMLA, which can simulate sequence data with user-specified disease and quantitative trait models. We implemented two disease models, in which the user can flexibly specify the number of disease loci, effect sizes or population attributable risk, disease prevalence, and risk or protective loci. We also implemented a quantitative trait model, in which the user can specify the number of quantitative trait loci (QTL), proportions of variance explained by the QTL, and genetic models. We compiled recombination rates from the HapMap project so that genomic structures similar to the real data can be simulated. Conclusions: SeqSIMLA can efficiently simulate sequence data with disease or quantitative trait models specified by the user. SeqSIMLA will be very useful for evaluating statistical properties for new study designs and new statistical methods using NGS. SeqSIMLA can be downloaded for free at http://seqsimla.sourceforge.net. - Background Computer programs that can simulate genotypes with phenotypes based on user-specified disease or quantitative trait models are essential in genetic studies. They can be used to evaluate statistical power when planning a study design based on the proposed sample size, the assumed genotypic relative risks (GRR), and allele frequencies. They are also useful for evaluating type I error rates for new statistical association tests and power comparisons between the new tests and other existing tests. Therefore, many simulation programs have been developed, mostly aiming to generate genome-wide association study (GWAS) data with dichotomous or quantitative traits [1-5]. Next-generation sequencing (NGS) has become a popular technique for identifying novel rare variants associated with complex diseases [6]. Statistical association tests that can account for rare variants have also been developed rapidly [7-10]. These tests aim to identify multiple rare causal variants in a group of variants selected by biological functions, such as exons, genes, and pathways. A common approach is to pool all the variants in the group to increase the statistical power for associations. To evaluate the statistical power for new tests, a simulation tool that can simulate multiple rare casual variants based on sequence data is necessary. However, simulation programs developed for GWAS may not be appropriate for evaluating statistical properties for NGS studies, because they were designed to simulate common variants based on GWAS panels (e.g., Illumina and Affymetrix) or HapMap project data [11]. Thus, computer software that can simulate sequence data based on realistic models with phenotypes becomes important. To our knowledge, SimRare is the only existing public software designed specifically to simulate sequence data with phenotypes [12]. SimRare has three modules, including a sequence generation module, a module for phenotype generation based on genotypes, and a module for evaluating association methods. The forward-time simulation algorithm [5,13] is used in SimRare to generate variant data. SimRare focuses on generating unrelated samples and on evaluating association methods developed for unrelated samples. As more and more family-based association studies using NGS are conducted [14-17], software that can generate sequence data in families will be very useful for evaluating the properties of family-based NGS analysis. We developed the Sequence and phenotype Simulator, SeqSIMLA, which can simulate sequence data in unrelated casecontrol or family samples with user-specified disease or quantitative trait models. SeqSIMLA uses GENOME [18] as the default sequence generator, which efficiently generates data using the coalescent model. SeqSIMLA also accepts a population of sequences generated by other sequence generators. SeqSIMLA can simulate multiple causal variants in regions on different chromosomes, where the recombination rates between regions are based on the rates estimated from the Hap Map project [11] or a user-specified fixed rate. We compared the features between SeqSIMLA and SimRare and used simulations to demonstrate that SeqSIMLA can generate data in a reasonable time frame. Implementation Sequence generation GENOME is used as the default tool to simulate a population of sequences based on the coalescent model. Alternatively, as other sequence simulators can have their own unique features, SeqSIMLA also accepts a population of sequences generated by other programs. GENOME either accepts different recombination rates among chromosomal blocks or assumes a fixed rate across the genome. There is no recombination within each of the chromosomal blocks. To simulate block structures similar to real populations, we downloaded the recombination hotspots across the genome from the HapMap project [11], with the highest recombination rate in each hotspot region used as the recombination rate for the center of the hotspot. Crossovers during meiosis are simulated based on the recombination rates for the centers of hotspots. Alternatively, the user can assume that the recombination rates are uniform across the chromosomes, which is the default setting in GENOME. Disease models We do not have restrictions on the number of disease loci to be simulated. A logistic function as follows is used to calculate the penetrance: where X = (G1,G2,,Gn) is a vector of genotype coding for n disease variants, B= (1, 2, , n) is a vector of the conditional log of odds ratios for the associated genotypes, and determines the disease prevalence K. The parameter is ln (f0/(1 f0)), which is the log odds of the penetrance for X with no mutant alleles. The odds ratio ei represents the increased odds for the disease for an additional mutant allele at variant i [19]. For the prevalence model (Model 1), the disease prevalence K is specified by the user. We iteratively search for in the range between 20 and 20 and calculate disease prevalence Ki based on i in iteration i. The value i is selected for if |Ki K| < , where is small (e.g., 0.001). Alternatively, the user can specify f0 directly, and uses the population attributable risk (PAR) to determine the GRRs for the disease loci (the PAR model or Model 2). The logistic function can be represented by the function of f0 and GRR : 1f 0f 0 in1GRRik P Affected X f 0 in1GRRik. where f0 is the baseline penetrance specified by the user, GRRi is the GRR for the genotype at marker i, PARi is the population attributable risk, and Ri is the risk allele frequency for marker i. The sum of PARi for the disease loci is equal to the overall PAR specified by the user. The parameter k is coded as the number of mutant allele counts (0, 1, 2) for an additive model, the presence/absence of an mutant allele (2/0) for a dominant (...truncated)