Does replication groups scoring reduce false positive rate in SNP interaction discovery?
Marko Toplak
0
2
Tomaz Curk
0
2
Janez Demsar
0
2
Blaz Zupan
0
1
2
0
Faculty of Computer and Information Science, University of Ljubljana
,
Trzaska 25, SI-1000 Ljubljana
,
Slovenia
1
Department of Molecular and Human Genetics, Baylor College of Medicine
,
1 Baylor Plaza, Houston, TX 77030
,
USA
2
Faculty of Computer and Information Science, University of Ljubljana
,
Trzaska 25, SI-1000 Ljubljana
,
Slovenia
Background: Computational methods that infer single nucleotide polymorphism (SNP) interactions from phenotype data may uncover new biological mechanisms in non-Mendelian diseases. However, practical aspects of such analysis face many problems. Present experimental studies typically use SNP arrays with hundreds of thousands of SNPs but record only hundreds of samples. Candidate SNP pairs inferred by interaction analysis may include a high proportion of false positives. Recently, Gayan et al. (2008) proposed to reduce the number of false positives by combining results of interaction analysis performed on subsets of data (replication groups), rather than analyzing the entire data set directly. If performing as hypothesized, replication groups scoring could improve interaction analysis and also any type of feature ranking and selection procedure in systems biology. Because Gayan et al. do not compare their approach to the standard interaction analysis techniques, we here investigate if replication groups indeed reduce the number of reported false positive interactions. Results: A set of simulated and false interaction-imputed experimental SNP data sets were used to compare the inference of SNP-SNP interactions by means of replication groups to the standard approach where the entire data set was directly used to score all candidate SNP pairs. In all our experiments, the inference of interactions from the entire data set (e.g. without using the replication groups) reported fewer false positives. Conclusions: With respect to the direct scoring approach the utility of replication groups does not reduce false positive rates, and may, depending on the data set, often perform worse.
-
Background
Onsets of many common chronic diseases are governed
by genetic factors that do not follow Mendelian or
single gene patterns. Such diseases include
hypertension, diabetes, various cancers, Alzheimers disease,
heart disease, Parkinsons disease, and others. Genetics
governing the susceptibility to these diseases remains
largely unknown. Their onset may be triggered by
polymorphisms across the genome whose effects do not
simply (linearly) sum up but instead interact in complex,
non-linear fashion. Such interactions are also referred to
as epistasis [1].
A number of computational methods for detection of
epistasis of single nucleotide polymorphisms (SNPs)
have been proposed [2]. They can be based either on
regression models [3], data mining [4], goodness of fit
tests [5] or information theory [6,7]. These methods
consider data sets that include phenotype observations
(presence or absence of a disease) in several hundreds
to several thousands cases and controls, each
characterized by a whole-genome profile consisting of several
hundred thousands SNPs. Synergistic SNPs may in the
extreme provide no information on the disease by
themselves, so the search for interesting SNP-SNP
interactions needs to consider all candidate pairs. In a study
using SNP chips with a million probes, analysis of
epistasis requires scoring of about 51011 hypotheses - one
for each candidate pair. Due to limited number of
samples, the number of spurious false positive results can be
overwhelming.
To reduce the number of reported false positive
interactions, Gayan et al. (2008) have recently proposed a
scoring approach called Hypothesis Free Clinical
Cloning (HFCC). The part of HFCC used for interaction
scoring is based on so-called replication groups, which
splits the available samples into non-overlapping
subsets, and reports only on SNP interactions with minimal
interaction score across all subsets above a certain
threshold. Authors hypothesize that this approach may
allow identification of frequent and consistent epistatic
effects at the expense of lower test power, improving
the filtering of false positive results at the expense of
increasing false negative rate.
Gayan et al. demonstrate the utility of HFCC in a
practical application, but do not specifically address
their otherwise intuitive assertion on the reduction of
false positive rate by HFCC. We were curious if the
utility of replication groups indeed performs as suggested.
Namely, if so, the approach would not only advance the
field of epistasis analysis, but could also spark new
improvements in techniques for SNP, gene and protein
scoring and ranking, where standard feature selection
procedures face similar problems due to low
samplesto-features rate.
We compared the SNP interaction scoring with
replication groups to the standard procedure which uses the
entire data set. We performed experiments on simulated
data and five data sets from Gene expression Omnibus
(GEO) [8]. We were unable to confirm that the use of
replication groups reduces the number of false positive
results. On the contrary, the standard approach
performed better in all our experiments.
generated according to six two-SNP epistasis models
(see Figure 1). Unlike Ritchie et al. (2003), our data sets
included multiple interactions, but such that each SNP
was involved in interaction with at most one other SNP.
Two different types of data sets with respect to the
number of SNPs were crafted, each comprising 200
control and 200 disease samples:
1. data sets with 100 SNPs (syn1), where each data
set included 24 SNP interactions (four
interactions for each of six epistasis models),
2. data sets with 500 SNPs (syn2), where each data
set included 60 SNP interactions (ten interactions
for each model).
Several simulated data sets were subject to different
types of noise including missing data (mN), genotyping
error (gN), phenocopies (pN), and genetic heterogeneity
(hN). Noise was imputed according to methods
described by Ritchie et al. (2003). Throughout this
report, data set names indicate the number of SNPs
(syn1 or syn2) and the type of the noise used (either no
suffix where no noise was applied, noise type where a
single type of noise was applied, or AN where all types
of noise were applied simultaneously).
SNP data from Gene Expression Omnibus
Gene Expression Omnibus [8] was considered for SNP
data sets that contain at least 200 samples with
approximately equal case/control distribution. Five data sets
met these criteria:
GSE6754 [9] describing families with two
individuals affected by autism spectrum disorders.
Figure 1 Disease penetrance models. Penetrance models used to simulate epistasis between two SNPs. Allele frequencies are denoted with p
and q. For example, model 1 specifies that 10% of individuals with genotypes AABb, AaBB, Aabb or aaBb and none of individuals with other
ge (...truncated)