Efforts Aimed at Reducing Noise, Data Overload in Microarrays (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/jnci/article-pdf/97/16/1173/7686515/dji268.pdf

Efforts Aimed at Reducing Noise, Data Overload in Microarrays

NEWS Efforts Aimed at Reducing Noise, Data Overload in Microarrays Journal of the National Cancer Institute, Vol. 97, No. 16, August 17, 2005 NEWS 1173 Microarrays have helped researchers identify previously unrecognized subtypes of cancers, and more recently they have been put to the test to determine their ability to identify cancers with better or worse prognosis (see News, Vol. 97, No. 5, p. 331, “Trial and Error: Prognostic Gene Signature Study Design Altered”). Now, researchers are working to find the best way to take the tool to a new level of complexity by asking it to help them identify genes involved in the basic biology of tumors. Experts in the field expect that the approach will work—but caution that it won’t be entirely straightforward. “For me, prediction is something we can often do without understanding the underlying biology, and that is much more difficult,” said Jill Mesirov, Ph.D., director Jill Mesirov of computational biology and bioinformatics at the Broad Institute at the Massachusetts Institute of Technology and Harvard in Cambridge, Mass. The problem boils down to issues of noise in the data and the ability to demonstrate biological relevance. For Mesirov, the use of gene sets, which are sometimes referred to as metagenes, can help address both problems. If, instead of analyzing the data in terms of individual genes, an investigator looks for gene sets that are enriched in a given tumor type, the data are likely to be more reproducible because the signal-to-noise ratio improves when 400 gene sets are analyzed versus 10,000 genes. Thus, genes that wouldn’t show up very well individually may do so if they NEWS 1174 NEWS one is separating the wheat from the chaff very well.” To get around the problem in his own laboratory, he now relies on constraintsbased analyses, in which he first separates tumor samples into known breast cancer subtypes, including Her2 status, estrogen receptor Dennis Slamon status, or BRCA1 or -2 status, and triple-negative disease, which lacks all three markers. Working from that starting place, Slamon can discern pathway or gene expression differences that arise in one tumor type versus another, which may mean that the gene or pathway is involved in tumorigenesis rather than in the final tumor phenotype. In other words, a gene set upregulated in all of the tumor types may be a signature for a late-stage disease phenotype, such as aggressiveness or invasiveness, but it is unlikely to be causal in the early stages of the disease, as that gene expression pattern occurs in tumors that have different underlying genetic problems. Using this strategy, his team found that the vascular endothelial growth factor (VEGF) is dramatically upregulated in Her2-positive cancers. VEGF is also upregulated in some of the tumors from other breast cancer classes, but the consistency of the upregulation in Her2 tumors led his group to think it wasn’t just a bystander, but part of the underlying problem in this pathology. “It’s interesting that you can make the intellectual link between Her2 and VEGF, but you still need to go back and do the biology,” said Slamon. To do this, his team looked to see if the Her2– VEGF correlation held up in a variety of samples. They also found that treating cells with trastuzumab (Herceptin), an antibody against Her2/neu protein, caused a drop in VEGF expression and that patients with higher VEGF expression tended to have more aggressive disease. From these and other preclinical data, which suggested a causative role for VEGF in the Her2 breast cancer phenotype, the team tested a combination of trastuzumab and a recombinant monoclonal antibody against VEGF in a phase I trial with nine patients with Her2-positive cancer. Two patients had a complete response, three had partial responses, and there were no unexpected toxicities, according to data Slamon presented earlier this year at the annual meeting of the American Association for Cancer Research. The team has now launched a 50-patient phase II trial. Experts agree that, to obtain that kind of success, researchers must use a reasonable number of samples. Just what that number is, though, is unclear, especially at the outset of an experiment because the “right” number will be determined in part by the expression level of the genes under study. David Bowtell, Ph.D., director of research and professor at the Peter MacCallum Cancer Institute in Melbourne, Australia, and his group recently published a study that used microarrays to categorize tumors of unknown primary origin. During that study, they looked at the number of samples required to derive a reproducible signature that could define the tissue of origin David Bowtell of a tumor. Their data show that although 10 samples were enough to adequately represent a relatively homogeneous tumor type such as colon cancer, they needed substantially more samples from histologically variable cancers, such as ovarian and lung, to obtain a reproducible signature. To gain enough ovarian tumor samples and to have clean, complete clinical data that go along with them, Bowtell is leading the Australian Ovarian Cancer Study, which aims to collect Journal of the National Cancer Institute, Vol. 97, No. 16, August 17, 2005 are coordinately expressed and biologically important. When she speaks to biologists, Mesirov points out that the biggest problem in many array experiments is that scientists end up with either too many differentially expressed genes—or none. If they have too many, they can cherry-pick the genes on the list that look most interesting to them based on prior knowledge, but those aren’t necessarily the most important, and therefore the approach can be misleading. The quintessential example of gene set analysis comes from a diabetes study led by the Broad Institute’s Vamsi Mootha, in which Mesirov’s group participated several years ago. They performed microarray analysis on muscle biopsy samples from patients with diabetes and from control subjects who had normal glucose tolerance. At the individual gene level, there were no statistically significant differences in the expression data. When they used gene set enrichment analysis, they found a statistically significant decrease in the genes in the oxidative phosphorylation pathway. Individually, the expression level of each gene decreased between the control and diabetic samples by only 15%–20%, but because there were approximately 100 genes in the set, the difference became statistically significant. The other advantage of gene sets, said Mesirov, is that they often come with substantial biological information, which provides a head start in a functional analysis. Of course, the output data are only as good as the data used to derive the gene set, cautioned Mesirov, which means that evaluating the strength of those data before intertwining them with the current experiment pays off. (Her team bundles several already annotated gene sets in the software (...truncated)