Meta-analysis of gene expression microarrays with missing replicates
BMC Bioinformatics
Meta-analysis of gene expression microarrays with missing replicates
Fan Shi 0 2
Gad Abraham 0 2
Christopher Leckie 0 2
Izhak Haviv 1
Adam Kowalczyk 2
0 National ICT Australia, Victoria Research Laboratory , Level 2, Building 193 , The University of Melbourne Victoria 3010 , Australia
1 Baker IDI Heart and Diabetes Institute , 250 Kooyong Road Caulield, Victoria 3162 , Australia
2 Department of Computer Science and Software Engineering, The University of Melbourne , Parkville, Victoria 3010 , Australia
Background: Many different microarray experiments are publicly available today. It is natural to ask whether different experiments for the same phenotypic conditions can be combined using meta-analysis, in order to increase the overall sample size. However, some genes are not measured in all experiments, hence they cannot be included or their statistical significance cannot be appropriately estimated in traditional meta-analysis. Nonetheless, these genes, which we refer to as incomplete genes, may also be informative and useful. Results: We propose a meta-analysis framework, called “Incomplete Gene Meta-analysis”, which can include incomplete genes by imputing the significance of missing replicates, and computing a meta-score for every gene across all datasets. We demonstrate that the incomplete genes are worthy of being included and our method is able to appropriately estimate their significance in two groups of experiments. We first apply the Incomplete Gene Meta-analysis and several comparable methods to five breast cancer datasets with an identical set of probes. We simulate incomplete genes by randomly removing a subset of probes from each dataset and demonstrate that our method consistently outperforms two other methods in terms of their false discovery rate. We also apply the methods to three gastric cancer datasets for the purpose of discriminating diffuse and intestinal subtypes. Conclusions: Meta-analysis is an effective approach that identifies more robust sets of differentially expressed genes from multiple studies. The incomplete genes that mainly arise from the use of different platforms may also have statistical and biological importance but are ignored or are not appropriately involved by previous studies. Our Incomplete Gene Meta-analysis is able to incorporate the incomplete genes by estimating their significance. The results on both breast and gastric cancer datasets suggest that the highly ranked genes and associated GO terms produced by our method are more significant and biologically meaningful according to the previous literature.
Background
Gene expression microarrays are a high throughput
technique for measuring gene expression levels in thousands
of genes simultaneously, and have been widely used in
the study of cancer genomics. An important application
of gene expression microarrays is detecting differentially
expressed genes by statistical analysis. For example, the
classical t-test can be used to assess the statistical
significance of genes in terms of their ability to discriminate
samples from two phenotypes.
While many microarray experiments from different
laboratories have been performed with the same research
aim, the results of these experiments may differ from
each other in many aspects, e.g., the platform, the probe
sets or the characteristics of the samples. Consequently,
the significant genes identified by the same statistical
analysis from different experiments may be inconsistent.
To overcome these inconsistencies, the evidence from
multiple studies needs to be combined. Several papers
[
1-3
] directly integrated gene expression data by aligning
genes/probes and concatenating samples. Meta-analysis
[
4
] is another way of generating more robust and
consistent statistical results by integrating multiple datasets and
outputting an overall score, which we refer to as a
metascore for each gene/probe across all studies. For example,
[
5
] integrated the p-values from the t-test, [
6-8
] integrated
the effect size based on the model of [
4
], [
9
] integrated the
ranks of genes, and [
10
] integrated the test statistics based
on a mixture model of the normal distribution by
considering the concordance between two datasets.
In addition, some papers used meta-analysis
techniques to discover significant gene functions. For example,
[
11
] applied meta-analysis directly to the functional
categories associated with each individual dataset, rather
than the expression data, in order to identify more
significant pathways; [
12
] used meta-analysis to predict
unknown functions of genes.
The integration of datasets from different platforms
can generate more statistically significant results by
reducing biases caused by specific platforms or
experimental conditions. The study in [
13
] first highlighted
the importance of the alignment between different
platforms as an issue for the meta-analysis of gene
expression microarrays. More recentl (...truncated)