Principal component analysis-based filtering improves detection for Affymetrix gene expression arrays
Jun Lu
1
2
Robnet T. Kerns
1
2
Shyamal D. Peddada
0
Pierre R. Bushel
0
2
0
Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park
,
NC 27709, USA
1
SRA International
, Inc
2
Microarray and Genome Informatics Group, National Institute of Environmental Health Sciences
Gene expression array technology has reached the stage of being routinely used to study clinical samples in search of diagnostic and prognostic biomarkers. Due to the nature of array experiments, which examine the expression of tens of thousands of genes simultaneously, the number of null hypotheses is large. Hence, multiple testing correction is often necessary to control the number of false positives. However, multiple testing correction can lead to low statistical power in detecting genes that are truly differentially expressed. Filtering out non-informative genes allows for reduction in the number of null hypotheses. While several filtering methods have been suggested, the appropriate way to perform filtering is still debatable. We propose a new filtering strategy for Affymetrix GeneChips , based on principal component analysis of probe-level gene expression data. Using a wholly defined spike-in data set and one from a diabetes study, we show that filtering by the proportion of variation accounted for by the first principal component (PVAC) provides increased sensitivity in detecting truly differentially expressed genes while controlling false discoveries. We demonstrate that PVAC exhibits equal or better performance than several widely used filtering methods. Furthermore, a data-driven approach that guides the selection of the filtering threshold value is also proposed.
-
Microarrays are routinely used to simultaneously examine
the expression of thousands or tens of thousands of genes
in various tissues and species (1). In recent years, there has
been an increase in the use of array technology to
study clinical samples in search of biomarkers and gene
expression signatures for improved diagnosis and
prognosis (25). Hence, the quality and the reproducibility of
the data become critically important (6,7).
One of the main applications of microarrays is to
identify differentially expressed genes (DEGs) between
two or more groups of biological samples. DEGs are
identified through statistical testing on a gene by gene
level. Given the nature of the array experiments where
tens of thousands of genes (or probe sets) are printed on
an array, the number of null hypotheses to be tested is
large. Hence, multiple testing correction is often necessary
in order to control for the number of false positives.
One of the commonly used methods for multiple testing
control is the false discovery rate (FDR) (8), which is the
expected ratio of the number of false rejections among the
total number of rejections. While FDR adjustment on raw
P-values is effective in controlling false positives, it is
associated with reduced power to detect truly DEGs.
In a typical experiment, the percentage of true positives
among all the genes present on an array is often times
low (usually <10%). Detecting such a small percentage
of DEGs with enough statistical power is clearly
challenging.
One strategy to tackle the issue of low power is to
reduce the number of null hypotheses by first filtering
out non-informative genes and then perform hypothesis
testing only on the genes that pass the filter (i.e. the
so-called two-stage approach) (9,10). Filtering is
motivated by the fact that most whole-genome arrays
are designed to be used to detect changes in expression
levels in all tissue types and treatment conditions.
However, it is well-known that, under a given condition,
many genes on an array are not expressed, expressed at
low levels, or expressed at levels with no biological
significance. In fact, it has been estimated that in a given tissue
only 3040% of the genes are expressed at array detectable
levels (11). From a recent study using deep sequencing
technology on multiple tissues and at the low threshold of
0.3 reads per kilobase exon model per million mapped
reads (RPKM), the number of genes expressed in human
and mouse tissues is estimated to be 6070% of RefSeq
coding genes (12). Given that the sensitivity of array
platforms is generally considered lower than deep sequencing
(with enough sequence depth), clearly a significant
percentage of genes are either not expressed or beyond the
detection limit in a typical array experiment. Filtering out
this group of genes would potentially be beneficial to DEG
detection. Furthermore, it has been shown that probe set
filtering increases concordance between Affymetrix and
quantitative reverse transcriptionPCR (qRTPCR)
expression measurements (13).
There are a number of filtering methods available in the
literature (9,10,1416). The most commonly used filter
statistics include the fraction of Present calls for
Affymetrix arrays, the overall mean and the overall
variance. Note that these statistics are calculated across all
samples (i.e. arrays) by ignoring the sample class labels.
Therefore, these approaches are also called non-specific
filters. It has been suggested that the non-specific filters
should be preferred as they do not interfere with
downstream statistical analyses (10,17). Based on several real
and simulated data sets, Hackstadt and Hess (9)
concluded that the variance filter is superior to the mean
filter. Similarly, using a Leukemia data set Bourgon et al.
(10) showed that the mean filter generally produced fewer
rejections than the variance filter. Due to the
subjective nature of filtering, comparing different methods
can be difficult and additional comparisons using different
control data sets are warranted. Moreover, questions still
remain on how to select the threshold in filtering and
whether further improvements can be made.
On the Affymetrix platform, one uses a probe set
containing multiple 25-bp oligonucleotides probes to
represent a gene. For this type of array, Talloen et al. (16)
recently introduced a filtering technique named
informative/non-informative calls (I/NI-calls). This method was
derived from the summarization algorithm, factor analysis
for robust microarray summarization (FARMS) (18).
It entails the utilization of Bayesian factor analysis on
probe level data and filtering out the genes by the
variance of a factor. One nice feature about their
method is that in their model the variance of the factor
can capture the correlation between probes. As all probes
in a probe set are designed to target the same transcript or
a transcript cluster (19), these probes should largely
perform concordantly when gene expression is measured.
In this report, we propose a new strategy to filter
non-informative features based on gene expression from
Affymetrix arrays. We explore the correlation feature
between probes by conducting principal component
analysis (PCA) on the probe-level data, and use the
variability captured by t (...truncated)