Graphical technique for identifying a monotonic variance stabilizing transformation for absolute gene intensity signals

BMC Bioinformatics, May 2004

Background The usefulness of log2 transformation for cDNA microarray data has led to its widespread application to Affymetrix data. For Affymetrix data, where absolute intensities are indicative of number of transcripts, there is a systematic relationship between variance and magnitude of measurements. Application of the log2 transformation expands the scale of genes with low intensities while compressing the scale of genes with higher intensities thus reversing the mean by variance relationship. The usefulness of these transformations needs to be examined. Results Using an Affymetrix GeneChip® dataset, problems associated with applying the log2 transformation to absolute intensity data are demonstrated. Use of the spread-versus-level plot to identify an appropriate variance stabilizing transformation is presented. For the data presented, the spread-versus-level plot identified a power transformation that successfully stabilized the variance of probe set summaries. Conclusion The spread-versus-level plot is helpful to identify transformations for variance stabilization. This is robust against outliers and avoids assumption of models and maximizations.

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-5-60.pdf

Graphical technique for identifying a monotonic variance stabilizing transformation for absolute gene intensity signals

Kellie J Archer 1 2 Catherine I Dumur 0 Viswanathan Ramakrishnan 2 0 Department of Pathology, Virginia Commonwealth University , Richmond, VA 23298 , USA 1 Center for the Study of Biological Complexity, Virginia Commonwealth University , Richmond, VA 23298 , USA 2 Department of Biostatistics, Virginia Commonwealth University , Richmond, VA 23298 , USA Background: The usefulness of log2 transformation for cDNA microarray data has led to its widespread application to Affymetrix data. For Affymetrix data, where absolute intensities are indicative of number of transcripts, there is a systematic relationship between variance and magnitude of measurements. Application of the log2 transformation expands the scale of genes with low intensities while compressing the scale of genes with higher intensities thus reversing the mean by variance relationship. The usefulness of these transformations needs to be examined. Results: Using an Affymetrix GeneChip dataset, problems associated with applying the log2 transformation to absolute intensity data are demonstrated. Use of the spread-versus-level plot to identify an appropriate variance stabilizing transformation is presented. For the data presented, the spread-versus-level plot identified a power transformation that successfully stabilized the variance of probe set summaries. Conclusion: The spread-versus-level plot is helpful to identify transformations for variance stabilization. This is robust against outliers and avoids assumption of models and maximizations. - Background Microarrays measure the abundance of thousands of mRNA transcripts in one experiment. Currently, two different microarray technologies dominate gene expression research efforts, namely, custom spotted arrays and Affymetrix GeneChips. Custom spotted arrays are characterized by long single strands of complimentary DNA (cDNA) affixed to a solid substrate in spots to which two different fluorescently labelled samples are hybridized. Affymetrix GeneChips are characterized by the use of several (1120) short oligonucleotide (25-mers) probes to interrogate for a single gene and to which one fluorescently labelled sample is hybridized. For both technologies, the fluorescence intensity for each probe/spot is assumed to be indicative of the amount of mRNA transcript in the sample. Since two samples are hybridized to custom spotted arrays, the ratios of the experimental signal relative to a control signal of the resulting intensities are analyzed. On the other hand, for the one sample hybridized to an Affymetrix GeneChip, the absolute intensities indicative of the number of transcripts are the resulting gene expression measures. To satisfy assumptions required for statistical analyses, data from both technologies are often transformed by a suitable function [1]. Specifically, the reasons for applying a transformation to a dataset include to achieve stability in variance, or to achieve linearity, additivity and/or normality. Sometimes a transformation is applied because it facilitates interpretation and induces symmetry. Such is the case when applying the log2 transformation to custom spotted microarray (cDNA) data. This transformation defines the relative abundance of a transcript in an experimental sample in comparison to the control sample as the unit of analysis. For genes where the experimental intensity is greater than the control intensity, the ratio could take values in the range (1, ); for genes where the experimental intensity is less than the control intensity, the ratio is compressed to the range (0, 1). The log2 transformation of ratio data promotes symmetry by treating under and over expressed genes similarly. A typical example used to elucidate this concept is the one that considers two genes, one with a ratio of 2.0 representing a doubling of intensity and another with a ratio of 0.5 represents a halving of intensity; on a log2 scale these values are symmetric about zero (no change) with values 1 and -1 respectively [1]. The application of the log2 transformation to cDNA data thus has proven useful for achieving symmetry as well as ease of interpretation. The usefulness of the log2 transformation for cDNA microarray data has led to its widespread application to Affymetrix GeneChip data as well [2-4]. While this transformation is appealing to the biologists due to the reasons stated above for cDNA arrays, it neither facilitates interpretation nor does it necessarily render true the assumptions required for statistical analysis, such as equal variance, normality, etc., in Affymetrix GeneChip data. When data are amounts or counts, as with Affymetrix GeneChip data, where the intensities are assumed to represent amounts of mRNA transcripts, there is often a systematic relationship between the variance of the measurements and magnitude of the measurements. The log2 transformation expands the scale of genes with low absolute intensities while compressing the scale of genes with higher intensities; it essentially reverses the direction of the relationship between the variance and the mean expression level. That is, after the transformation lowly expressed genes have a higher variance and highly expressed genes have a lower variance [5]. Recently, other variance stabilizing transformation methods for microarray gene expression data have been introduced [6-8]. The gene expression data have been modelled as where y represents the measured intensity, represents the average background, represents the true gene expression level, with normally distributed error terms and with zero mean and differing non-zero variances [6]. For this model, when is large, y is distributed approximately as a lognormal random variable. Therefore, a log transformation of y results in observations with constant variance when is large. In this case the generalized logarithm transformation is where c = 2 / S , stabilizes the asymptotic variance [6] for large samples. Rocke and Durbin [7] compared three logarithmic-based transformations (the generalized logarithm, the started logarithm, and the log-linear hybrid) using a simulation study and in application to an existing dataset. They found the generalized logarithm resulted in better overall performance in achieving variance stabilization. In this paper, an Affymetrix GeneChip HG-U133A dataset consisting of 16 technical replicates (QAQC Dataset), where the Microarray Suite Software (version 5.0) was used to derive the expression summaries for all probe sets, is used to demonstrate some of the problems associated with applying the log2 transformation to absolute intensity data. Another approach to identify an appropriate variance stabilizing transformation using the spread-versuslevel plot [9] is proposed. The spread-versus-level plot plots the log of the median on the x-axis ( log(MX ) ) against the log of the fourth-spread on the y-axis. The slope of the spread-versus-level plot (b) can be used to suggest a power transformatio (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/1471-2105-5-60.pdf
Article home page: http://www.biomedcentral.com/1471-2105/5/60

Kellie J Archer, Catherine I Dumur, Viswanathan Ramakrishnan. Graphical technique for identifying a monotonic variance stabilizing transformation for absolute gene intensity signals, BMC Bioinformatics, 2004, pp. 60, 5, DOI: 10.1186/1471-2105-5-60