Principal component analysis-based filtering improves detection for Affymetrix gene expression arrays (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/39/13/e86.full.pdf

Principal component analysis-based filtering improves detection for Affymetrix gene expression arrays

Jun Lu 1 2 Robnet T. Kerns 1 2 Shyamal D. Peddada 0 Pierre R. Bushel 0 2 0 Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park , NC 27709, USA 1 SRA International , Inc 2 Microarray and Genome Informatics Group, National Institute of Environmental Health Sciences Gene expression array technology has reached the stage of being routinely used to study clinical samples in search of diagnostic and prognostic biomarkers. Due to the nature of array experiments, which examine the expression of tens of thousands of genes simultaneously, the number of null hypotheses is large. Hence, multiple testing correction is often necessary to control the number of false positives. However, multiple testing correction can lead to low statistical power in detecting genes that are truly differentially expressed. Filtering out non-informative genes allows for reduction in the number of null hypotheses. While several filtering methods have been suggested, the appropriate way to perform filtering is still debatable. We propose a new filtering strategy for Affymetrix GeneChips , based on principal component analysis of probe-level gene expression data. Using a wholly defined spike-in data set and one from a diabetes study, we show that filtering by the proportion of variation accounted for by the first principal component (PVAC) provides increased sensitivity in detecting truly differentially expressed genes while controlling false discoveries. We demonstrate that PVAC exhibits equal or better performance than several widely used filtering methods. Furthermore, a data-driven approach that guides the selection of the filtering threshold value is also proposed. - Microarrays are routinely used to simultaneously examine the expression of thousands or tens of thousands of genes in various tissues and species (1). In recent years, there has been an increase in the use of array technology to study clinical samples in search of biomarkers and gene expression signatures for improved diagnosis and prognosis (25). Hence, the quality and the reproducibility of the data become critically important (6,7). One of the main applications of microarrays is to identify differentially expressed genes (DEGs) between two or more groups of biological samples. DEGs are identified through statistical testing on a gene by gene level. Given the nature of the array experiments where tens of thousands of genes (or probe sets) are printed on an array, the number of null hypotheses to be tested is large. Hence, multiple testing correction is often necessary in order to control for the number of false positives. One of the commonly used methods for multiple testing control is the false discovery rate (FDR) (8), which is the expected ratio of the number of false rejections among the total number of rejections. While FDR adjustment on raw P-values is effective in controlling false positives, it is associated with reduced power to detect truly DEGs. In a typical experiment, the percentage of true positives among all the genes present on an array is often times low (usually <10%). Detecting such a small percentage of DEGs with enough statistical power is clearly challenging. One strategy to tackle the issue of low power is to reduce the number of null hypotheses by first filtering out non-informative genes and then perform hypothesis testing only on the genes that pass the filter (i.e. the so-called two-stage approach) (9,10). Filtering is motivated by the fact that most whole-genome arrays are designed to be used to detect changes in expression levels in all tissue types and treatment conditions. However, it is well-known that, under a given condition, many genes on an array are not expressed, expressed at low levels, or expressed at levels with no biological significance. In fact, it has been estimated that in a given tissue only 3040% of the genes are expressed at array detectable levels (11). From a recent study using deep sequencing technology on multiple tissues and at the low threshold of 0.3 reads per kilobase exon model per million mapped reads (RPKM), the number of genes expressed in human and mouse tissues is estimated to be 6070% of RefSeq coding genes (12). Given that the sensitivity of array platforms is generally considered lower than deep sequencing (with enough sequence depth), clearly a significant percentage of genes are either not expressed or beyond the detection limit in a typical array experiment. Filtering out this group of genes would potentially be beneficial to DEG detection. Furthermore, it has been shown that probe set filtering increases concordance between Affymetrix and quantitative reverse transcriptionPCR (qRTPCR) expression measurements (13). There are a number of filtering methods available in the literature (9,10,1416). The most commonly used filter statistics include the fraction of Present calls for Affymetrix arrays, the overall mean and the overall variance. Note that these statistics are calculated across all samples (i.e. arrays) by ignoring the sample class labels. Therefore, these approaches are also called non-specific filters. It has been suggested that the non-specific filters should be preferred as they do not interfere with downstream statistical analyses (10,17). Based on several real and simulated data sets, Hackstadt and Hess (9) concluded that the variance filter is superior to the mean filter. Similarly, using a Leukemia data set Bourgon et al. (10) showed that the mean filter generally produced fewer rejections than the variance filter. Due to the subjective nature of filtering, comparing different methods can be difficult and additional comparisons using different control data sets are warranted. Moreover, questions still remain on how to select the threshold in filtering and whether further improvements can be made. On the Affymetrix platform, one uses a probe set containing multiple 25-bp oligonucleotides probes to represent a gene. For this type of array, Talloen et al. (16) recently introduced a filtering technique named informative/non-informative calls (I/NI-calls). This method was derived from the summarization algorithm, factor analysis for robust microarray summarization (FARMS) (18). It entails the utilization of Bayesian factor analysis on probe level data and filtering out the genes by the variance of a factor. One nice feature about their method is that in their model the variance of the factor can capture the correlation between probes. As all probes in a probe set are designed to target the same transcript or a transcript cluster (19), these probes should largely perform concordantly when gene expression is measured. In this report, we propose a new strategy to filter non-informative features based on gene expression from Affymetrix arrays. We explore the correlation feature between probes by conducting principal component analysis (PCA) on the probe-level data, and use the variability captured by t (...truncated)