FIRMA: a method for detection of alternative splicing from exon array data
E. Purdom
3
K. M. Simpson
1
M. D. Robinson
0
1
J. G. Conboy
4
A. V. Lapuk
4
T. P. Speed
1
3
Associate Editor: David Rocke
0
Department of Medical Biology, University of Melbourne
, Parkville, Victoria 3010,
Australia
1
The Walter and Eliza Hall Institute, 1G Royal Parade
, Parkville, Victoria, 3050
2
3860,
Berkeley, CA 94720-3860, USA
3
Department of Statistics, University of California at Berkeley
, 367 Evans Hall
4
Life Sciences Division, Lawrence Berkeley National Laboratory
, 1 Cyclotron Road,
Berkeley, CA 94720, USA
Motivation: Analyses of EST data show that alternative splicing is much more widespread than once thought. The advent of exon and tiling microarrays means that researchers now have the capacity to experimentally measure alternative splicing on a genome wide level. New methods are needed to analyze the data from these arrays. Results: We present a method, finding isoforms using robust multichip analysis (FIRMA), for detecting differential alternative splicing in exon array data. FIRMA has been developed for Affymetrix exon arrays, but could in principle be extended to other exon arrays, tiling arrays or splice junction arrays. We have evaluated the method using simulated data, and have also applied it to two datasets: a panel of 11 human tissues and a set of 10 pairs of matched normal and tumor colon tissue. FIRMA is able to detect exons in several genes confirmed by reverse transcriptase PCR. Availability: R code implementing our methods is contributed to the package aroma.affymetrix. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Alternative splicing is thought to have several roles in complex
organisms, primarily in increasing protein diversity (Maniatis and
Tasic, 2002). It can affect the intracellular localization, binding
properties or stability of a protein, or regulate its expression via
nonsense-mediated decay (NMD) (Stamm et al., 2005). These events
usually occur in a regulated manner, but if an aberrant splicing event
occurs, it can be causative for, or symptomatic of, disease. More than
15% of heritable human diseases are known to be associated with
mutations in splice sites or in splicing regulatory elements (Matlin
et al., 2005). In particular, aberrant premRNA splicing events are
known to be implicated in several types of cancer (Brinkman, 2004;
Venables, 2004).
Previously thought to be a relatively uncommon phenomenon,
alternative splicing has recently been shown to be widespread
To whom correspondence should be addressed.
throughout the genome. Analyses of data on human expressed
sequence tags (ESTs) give estimated lower bounds between
35% and 59% for the proportion of genes which have at least
one splice variant (Modrek and Lee, 2002). The frequency of
functional alternative splicing events is probably lower than
this. Several groups have searched for alternative splicing events
conserved between human and mouse, and their results suggest
that the proportion of functionally alternatively spliced genes is
10% (Sorek et al., 2004; Sugnet et al., 2004; Yeo et al., 2005).
A weakness of all EST-based methods is that they are biased towards
genes which have greater EST coverage (Modrek and Lee, 2002).
Several kinds of alternative splicing have been observed (see
Black, 2003, for a recent review). The most common form is
skipping or inclusion of one or more cassette exons (roughly
4050% of cases based on bioinformatic evidence (Clark and
Thanaraj, 2002; Sugnet et al., 2004), these being exons which are
wholly present in some transcripts, and wholly absent in some others.
Alternatively, mutually exclusive cassette exon usage can take place;
e.g. exon A or exon B forms part of the transcript, but never A
and B together (more generally, multiple exons can exhibit mutual
exclusivity). Usage of alternative 3 or 5 splice sites can result in
shortening or lengthening of an exon. Other types of alternative
splicing that have been observed are alternative promoter usage,
alternative polyadenylation sites and intron retention. Additionally,
any combination of the above may occur in an alternatively spliced
transcript (Black, 2003).
Skipping or inclusion of internal cassette exons is the most
common kind of alternative splicing, and possibly the easiest to
detect and verify. For this reason, we have focused on identifying
specific exons showing patterns of differential alternative expression
and have not approached the problem of reconstructing more
complicated transcript patterns.
Our algorithm FIRMA has been developed for analyzing the
Affymetrix exon array, Santa Clara, California, USA, which queries
the expression level of well annotated and as well as predicted
exons. In brief, FIRMA scores each exon as to whether its probes
systematically deviate from the expected gene expression level.
With a small number of probes per exon (four or less), this is a
challenging microarray platform to analyzesuch deviations can
come from a myriad of biological and technical factors unrelated
to alternative splicing. We show that FIRMA performs well in
detecting exon-specific changes in expression and therefore can
contribute substantially to the detection of regulated alternative
splicing. Of course a single scoring method can only be one step
in the analysis, and any results must be evaluated in the light of
these other complications.
The GeneChip Human Exon 1.0 ST (sense target) array is a whole-genome
array, containing over 1.4 million probesets of up to four perfect match (PM)
probes each, spread across exons from all known genes, plus a number of
additional regions based on other annotation sources, including GENSCAN
predictions and ESTs from dbEST. In the design phase, sequences from all
the annotation sources were mapped to the July 2003 version of the human
genome (UCSC hg16, NCBI 34). Regions which had some evidence from
one or more sources for being transcribed were divided into probe selection
regions (PSR) according to the presence of canonical splice sites, CDS start
and stop positions or polyadenylation sites. Probes were then selected from
within PSRs >25 bp in length. Each PSR corresponds to a probeset, which
generally contains four possibly overlapping probes (sometimes fewer).
About a quarter of the probesets are based solely on EST evidence, while
another quarter are based solely on GENSCAN predictions (GeneChip
Exon Array Design Technical Note, Affymetrix).
The array contains only PM probes, with a small number of generic
mismatch probes for the purposes of background correction. There are no
probes which span exonexon junctions.
Association of probesets with genes is not made at design time. Instead,
these main-design probesets are annotated afterwards, using their alignment
to the genome (Exon Probeset Annotations Whitepaper, Affymetrix). This
process has been undertaken by Affymetrix, first for NCBI Build 34 of the
genome, and more recently for Build 35. The result is (...truncated)