Graphical technique for identifying a monotonic variance stabilizing transformation for absolute gene intensity signals
Kellie J Archer
1
2
Catherine I Dumur
0
Viswanathan Ramakrishnan
2
0
Department of Pathology, Virginia Commonwealth University
,
Richmond, VA 23298
,
USA
1
Center for the Study of Biological Complexity, Virginia Commonwealth University
,
Richmond, VA 23298
,
USA
2
Department of Biostatistics, Virginia Commonwealth University
,
Richmond, VA 23298
,
USA
Background: The usefulness of log2 transformation for cDNA microarray data has led to its widespread application to Affymetrix data. For Affymetrix data, where absolute intensities are indicative of number of transcripts, there is a systematic relationship between variance and magnitude of measurements. Application of the log2 transformation expands the scale of genes with low intensities while compressing the scale of genes with higher intensities thus reversing the mean by variance relationship. The usefulness of these transformations needs to be examined. Results: Using an Affymetrix GeneChip dataset, problems associated with applying the log2 transformation to absolute intensity data are demonstrated. Use of the spread-versus-level plot to identify an appropriate variance stabilizing transformation is presented. For the data presented, the spread-versus-level plot identified a power transformation that successfully stabilized the variance of probe set summaries. Conclusion: The spread-versus-level plot is helpful to identify transformations for variance stabilization. This is robust against outliers and avoids assumption of models and maximizations.
-
Background
Microarrays measure the abundance of thousands of
mRNA transcripts in one experiment. Currently, two
different microarray technologies dominate gene expression
research efforts, namely, custom spotted arrays and
Affymetrix GeneChips. Custom spotted arrays are
characterized by long single strands of complimentary DNA
(cDNA) affixed to a solid substrate in spots to which two
different fluorescently labelled samples are hybridized.
Affymetrix GeneChips are characterized by the use of
several (1120) short oligonucleotide (25-mers) probes to
interrogate for a single gene and to which one
fluorescently labelled sample is hybridized. For both
technologies, the fluorescence intensity for each probe/spot is
assumed to be indicative of the amount of mRNA
transcript in the sample. Since two samples are hybridized to
custom spotted arrays, the ratios of the experimental
signal relative to a control signal of the resulting intensities
are analyzed. On the other hand, for the one sample
hybridized to an Affymetrix GeneChip, the absolute
intensities indicative of the number of transcripts are the
resulting gene expression measures.
To satisfy assumptions required for statistical analyses,
data from both technologies are often transformed by a
suitable function [1]. Specifically, the reasons for applying
a transformation to a dataset include to achieve stability
in variance, or to achieve linearity, additivity and/or
normality. Sometimes a transformation is applied
because it facilitates interpretation and induces symmetry.
Such is the case when applying the log2 transformation to
custom spotted microarray (cDNA) data. This
transformation defines the relative abundance of a transcript in an
experimental sample in comparison to the control sample
as the unit of analysis. For genes where the experimental
intensity is greater than the control intensity, the ratio
could take values in the range (1, ); for genes where the
experimental intensity is less than the control intensity,
the ratio is compressed to the range (0, 1). The log2
transformation of ratio data promotes symmetry by treating
under and over expressed genes similarly. A typical
example used to elucidate this concept is the one that considers
two genes, one with a ratio of 2.0 representing a doubling
of intensity and another with a ratio of 0.5 represents a
halving of intensity; on a log2 scale these values are
symmetric about zero (no change) with values 1 and -1
respectively [1]. The application of the log2
transformation to cDNA data thus has proven useful for achieving
symmetry as well as ease of interpretation.
The usefulness of the log2 transformation for cDNA
microarray data has led to its widespread application to
Affymetrix GeneChip data as well [2-4]. While this
transformation is appealing to the biologists due to the reasons
stated above for cDNA arrays, it neither facilitates
interpretation nor does it necessarily render true the
assumptions required for statistical analysis, such as equal
variance, normality, etc., in Affymetrix GeneChip data.
When data are amounts or counts, as with Affymetrix
GeneChip data, where the intensities are assumed to
represent amounts of mRNA transcripts, there is often a
systematic relationship between the variance of the
measurements and magnitude of the measurements. The
log2 transformation expands the scale of genes with low
absolute intensities while compressing the scale of genes
with higher intensities; it essentially reverses the direction
of the relationship between the variance and the mean
expression level. That is, after the transformation lowly
expressed genes have a higher variance and highly
expressed genes have a lower variance [5].
Recently, other variance stabilizing transformation
methods for microarray gene expression data have been
introduced [6-8]. The gene expression data have been
modelled as
where y represents the measured intensity, represents
the average background, represents the true gene
expression level, with normally distributed error terms and
with zero mean and differing non-zero variances [6]. For
this model, when is large, y is distributed approximately
as a lognormal random variable. Therefore, a log
transformation of y results in observations with constant variance
when is large. In this case the generalized logarithm
transformation is
where c = 2 / S , stabilizes the asymptotic variance [6]
for large samples. Rocke and Durbin [7] compared three
logarithmic-based transformations (the generalized
logarithm, the started logarithm, and the log-linear hybrid)
using a simulation study and in application to an existing
dataset. They found the generalized logarithm resulted in
better overall performance in achieving variance
stabilization.
In this paper, an Affymetrix GeneChip HG-U133A
dataset consisting of 16 technical replicates (QAQC Dataset),
where the Microarray Suite Software (version 5.0) was
used to derive the expression summaries for all probe sets,
is used to demonstrate some of the problems associated
with applying the log2 transformation to absolute
intensity data. Another approach to identify an appropriate
variance stabilizing transformation using the
spread-versuslevel plot [9] is proposed. The spread-versus-level plot
plots the log of the median on the x-axis ( log(MX ) )
against the log of the fourth-spread on the y-axis. The
slope of the spread-versus-level plot (b) can be used to
suggest a power transformatio (...truncated)