"Does replication groups scoring reduce false positive rate in SNP interaction discovery?: Response"
BMC Genomics
C"oDrreospeonsderneceplication groups scoring reduce false positive rate in SNP interaction discovery?: Response"
Javier Gayn 0
Antonio Gonzlez-Prez 0
Agustn Ruiz 0
0 Neocodex , Avda. Charles Darwin 6, 41092 Sevilla , Spain
A response to Toplak et al: Does replication groups scoring reduce false positive rate in SNP interaction discovery? BMC Genomics 2010, 11:58. Background: The genomewide evaluation of genetic epistasis is a computationally demanding task, and a current challenge in Genetics. HFCC (Hypothesis-Free Clinical Cloning) is one of the methods that have been suggested for genomewide epistasis analysis. In order to perform an exhaustive search of epistasis, HFCC has implemented several tools and data filters, such as the use of multiple replication groups, and direction of effect and control filters. A recent article has claimed that the use of multiple replication groups (as implemented in HFCC) does not reduce the false positive rate, and we hereby try to clarify these issues. Results/Discussion: HFCC uses, as an analysis strategy, the possibility of replicating findings in multiple replication groups, in order to select a liberal subset of preliminary results that are above a statistical criterion and consistent in direction of effect. We show that the use of replication groups and the direction filter reduces the false positive rate of a study, although at the expense of lowering the overall power of the study. A post-hoc analysis of these selected signals in the combined sample could then be performed to select the most promising results. Conclusion: Replication of results in independent samples is generally used in scientific studies to establish credibility in a finding. Nonetheless, the combined analysis of several datasets is known to be a preferable and more powerful strategy for the selection of top signals. HFCC is a flexible and complete analysis tool, and one of its analysis options combines these two strategies: A preliminary multiple replication group analysis to eliminate inconsistent false positive results, and a post-hoc combined-group analysis to select the top signals.
-
Background
Epistasis, the interaction among genetic loci, is a frequent
phenomenon in nature [1]. However, the detection of
epistatic effects in observational data has not been an easy
task because of the lack of appropriate samples and
methodologies [2,3]. Thanks to the recent collection of large
genetic datasets, we are now at a position where the study
of epistasis in humans is becoming possible. Nonetheless,
the genomewide evaluation of genetic epistasis is a
computationally and statistically demanding task, due to the
large number of possible combinations of loci that can be
formed. For example, for a genomewide analysis with
* Correspondence:
1 Neocodex, Avda. Charles Darwin 6, 41092 Sevilla, Spain
Full list of author information is available at the end of the article
100,000 SNPs, there are 5 109 two-locus combinations,
and 1.7 1014 three-locus combinations. For 1 million
SNPs, there are 5 1011 two-locus and 1.7 1017
threelocus combinations.
The exhaustive search for epistasis across this large
data space is a challenge for today's genehunters. In this
context, a variety of software has been released to tackle
this issue ([4] for review). HFCC (Hypothesis-Free
Clinical Cloning) [4] is one of these tools that have made
possible genomewide epistasis analysis. This software uses
case-control samples to test for single-locus or
multilocus genetic association. Multi-locus combinations that
are significantly associated with a trait are then subjected
to a variety of post-hoc tests to determine the degree of
non-additivity of the marker combination, that is, to
separate additive multi-marker combinations from more
epistatic interactions. Those genetic effects that are due to
epistatic interactions are one of the priorities in our
analysis, because they complement those effects detectable by
single-locus analysis.
Because HFCC performs an exhaustive search of the
entire data space, several optional tools have been
implemented to overcome this multiple testing problem, such
as multiple replication groups, the direction filter, the
control filter, the tracking filter, etc. For example, the
case-control sample can be simultaneously analyzed in
replication groups, to select only significant results in
each group. There is also a complementary direction
filter, which selects only those results which are consistent
across groups, that is, they are significant and with the
same direction of effect in each group.
A recent article by Toplak et al. [5] has been inspired by
the following statement in HFCC's article [[4], page 3]: "...
a multi-group analysis strategy ... allows the replication of
consistent results, and it also aids the elimination of false
positive results, a very attractive quality for genome-wide
analysis of large number of genetic markers." These
authors have interpreted the above statement as claiming
that using replication groups, by itself, reduces "the false
positive rate" [[5], page 2], and can therefore "... improve
... any type of feature ranking and selection procedure ..."
[[5], page 1].
Our approach to detecting multi-locus effects uses a
two-stage analysis strategy. In a first step, a large subset of
preliminary results that are associated with the disease
are selected. Then, this liberal subset of results is
subjected to a post-hoc analysis to select the most promising
results [[4], page 7]. Using replication groups is only one
of the possible analysis strategies of HFCC, aiming to
reduce the number of selected signals, that is, it
eliminates a larger amount of the tail of the distribution of the
results, which are mostly false positives, together with
some true effects that are undistinguishable from
unassociated variants [10].
Our original statement claims that using multiple
replication groups should reduce the number of signals
selected with a liberal statistical threshold (mostly false
positives), but does not claim to use this strategy to select
the top results of the study. Indeed, to select the top
findings, we analyze the combined sample [[4], page 6],
which, as we state repeatedly across our article [[4], pages
2,3,4], is known to be the most powerful analysis strategy
[6,7].
Therefore it seems Toplak et al. interpreted our article
incorrectly, and applied this misinterpretation to test
their own hypothesis (replication groups aids
prioritization of signals), which was finally rejected by their
simulations. In their paper, these authors provide evidence
that the analysis of a combined sample is less prone to
false positives than the separate analysis of replication
samples. However, it is not clear from their article
whether they have selected signals consistent in strength
and direction, as suggested in our paper and in the
guidelines for replication of association results [8], what may
compromise to some exte (...truncated)