Meta-analysis of gene expression microarrays with missing replicates

BMC Bioinformatics, Mar 2011

Background Many different microarray experiments are publicly available today. It is natural to ask whether different experiments for the same phenotypic conditions can be combined using meta-analysis, in order to increase the overall sample size. However, some genes are not measured in all experiments, hence they cannot be included or their statistical significance cannot be appropriately estimated in traditional meta-analysis. Nonetheless, these genes, which we refer to as incomplete genes, may also be informative and useful. Results We propose a meta-analysis framework, called "Incomplete Gene Meta-analysis", which can include incomplete genes by imputing the significance of missing replicates, and computing a meta-score for every gene across all datasets. We demonstrate that the incomplete genes are worthy of being included and our method is able to appropriately estimate their significance in two groups of experiments. We first apply the Incomplete Gene Meta-analysis and several comparable methods to five breast cancer datasets with an identical set of probes. We simulate incomplete genes by randomly removing a subset of probes from each dataset and demonstrate that our method consistently outperforms two other methods in terms of their false discovery rate. We also apply the methods to three gastric cancer datasets for the purpose of discriminating diffuse and intestinal subtypes. Conclusions Meta-analysis is an effective approach that identifies more robust sets of differentially expressed genes from multiple studies. The incomplete genes that mainly arise from the use of different platforms may also have statistical and biological importance but are ignored or are not appropriately involved by previous studies. Our Incomplete Gene Meta-analysis is able to incorporate the incomplete genes by estimating their significance. The results on both breast and gastric cancer datasets suggest that the highly ranked genes and associated GO terms produced by our method are more significant and biologically meaningful according to the previous literature.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://www.biomedcentral.com/content/pdf/1471-2105-12-84.pdf

Meta-analysis of gene expression microarrays with missing replicates

BMC Bioinformatics Meta-analysis of gene expression microarrays with missing replicates Fan Shi 0 2 Gad Abraham 0 2 Christopher Leckie 0 2 Izhak Haviv 1 Adam Kowalczyk 2 0 National ICT Australia, Victoria Research Laboratory , Level 2, Building 193 , The University of Melbourne Victoria 3010 , Australia 1 Baker IDI Heart and Diabetes Institute , 250 Kooyong Road Caulield, Victoria 3162 , Australia 2 Department of Computer Science and Software Engineering, The University of Melbourne , Parkville, Victoria 3010 , Australia Background: Many different microarray experiments are publicly available today. It is natural to ask whether different experiments for the same phenotypic conditions can be combined using meta-analysis, in order to increase the overall sample size. However, some genes are not measured in all experiments, hence they cannot be included or their statistical significance cannot be appropriately estimated in traditional meta-analysis. Nonetheless, these genes, which we refer to as incomplete genes, may also be informative and useful. Results: We propose a meta-analysis framework, called “Incomplete Gene Meta-analysis”, which can include incomplete genes by imputing the significance of missing replicates, and computing a meta-score for every gene across all datasets. We demonstrate that the incomplete genes are worthy of being included and our method is able to appropriately estimate their significance in two groups of experiments. We first apply the Incomplete Gene Meta-analysis and several comparable methods to five breast cancer datasets with an identical set of probes. We simulate incomplete genes by randomly removing a subset of probes from each dataset and demonstrate that our method consistently outperforms two other methods in terms of their false discovery rate. We also apply the methods to three gastric cancer datasets for the purpose of discriminating diffuse and intestinal subtypes. Conclusions: Meta-analysis is an effective approach that identifies more robust sets of differentially expressed genes from multiple studies. The incomplete genes that mainly arise from the use of different platforms may also have statistical and biological importance but are ignored or are not appropriately involved by previous studies. Our Incomplete Gene Meta-analysis is able to incorporate the incomplete genes by estimating their significance. The results on both breast and gastric cancer datasets suggest that the highly ranked genes and associated GO terms produced by our method are more significant and biologically meaningful according to the previous literature. Background Gene expression microarrays are a high throughput technique for measuring gene expression levels in thousands of genes simultaneously, and have been widely used in the study of cancer genomics. An important application of gene expression microarrays is detecting differentially expressed genes by statistical analysis. For example, the classical t-test can be used to assess the statistical significance of genes in terms of their ability to discriminate samples from two phenotypes. While many microarray experiments from different laboratories have been performed with the same research aim, the results of these experiments may differ from each other in many aspects, e.g., the platform, the probe sets or the characteristics of the samples. Consequently, the significant genes identified by the same statistical analysis from different experiments may be inconsistent. To overcome these inconsistencies, the evidence from multiple studies needs to be combined. Several papers [ 1-3 ] directly integrated gene expression data by aligning genes/probes and concatenating samples. Meta-analysis [ 4 ] is another way of generating more robust and consistent statistical results by integrating multiple datasets and outputting an overall score, which we refer to as a metascore for each gene/probe across all studies. For example, [ 5 ] integrated the p-values from the t-test, [ 6-8 ] integrated the effect size based on the model of [ 4 ], [ 9 ] integrated the ranks of genes, and [ 10 ] integrated the test statistics based on a mixture model of the normal distribution by considering the concordance between two datasets. In addition, some papers used meta-analysis techniques to discover significant gene functions. For example, [ 11 ] applied meta-analysis directly to the functional categories associated with each individual dataset, rather than the expression data, in order to identify more significant pathways; [ 12 ] used meta-analysis to predict unknown functions of genes. The integration of datasets from different platforms can generate more statistically significant results by reducing biases caused by specific platforms or experimental conditions. The study in [ 13 ] first highlighted the importance of the alignment between different platforms as an issue for the meta-analysis of gene expression microarrays. More recentl (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/1471-2105-12-84.pdf

Fan Shi, Gad Abraham, Christopher Leckie, Izhak Haviv, Adam Kowalczyk. Meta-analysis of gene expression microarrays with missing replicates, BMC Bioinformatics, 2011, pp. 84, 12, DOI: 10.1186/1471-2105-12-84