A toolkit for analysing large-scale plant small RNA datasets
Simon Moxon
2
Frank Schwach
2
Tamas Dalmay
1
Dan MacLean
0
David J. Studholme
0
Vincent Moulton
2
Associate Editor: Ivo Hofacker
0
The Sainsbury Laboratory
, Colney Lane, Norwich, NR4 7UH,
UK
1
School of Biological Sciences, University of East Anglia
, Norwich, NR4 7TJ
2
School of Computing Sciences
Summary: Recent developments in high-throughput sequencing technologies have generated considerable demand for tools to analyse large datasets of small RNA sequences. Here, we describe a suite of web-based tools for processing plant small RNA datasets. Our tools can be used to identify micro RNAs and their targets, compare expression levels in sRNA loci, and find putative transacting siRNA loci. Availability: The tools are freely available for use at http://srnatools.cmp.uea.ac.uk Contact:
-
INTRODUCTION
Several classes of small (2030 nt) non-coding RNAs (sRNAs) can
be distinguished by biogenesis and function in post-transcriptional
gene regulation and epigenetic control in plants, animals and
fungi (for reviews see: Brodersen and Voinnet, 2006; Lippman
and Martienssen, 2004). Micro RNAs (miRNAs) and
transacting siRNAs (ta-siRNAs) are two important classes of sRNAs
that both induce post-transcriptional silencing of target genes.
Computationally, miRNAs can be identified by their characteristic
fold-back precursors, while ta-siRNA are found by a phased
alignment pattern at their genomic regions of origin (Axtell et al.,
2006).
Novel high-throughput sequencing technologies greatly facilitate
small RNA detection and analysis (Hafner et al., 2007). However,
the lack of supporting data analysis tools presents a major bottleneck.
Here, we present an easy-to-use web-based toolkit that is specifically
geared towards the analysis of large-scale plant sRNA datasets.
Plant specific tools are necessary due to important differences in
the biogenesis and mode of action between plant and animal sRNAs
(Millar and Waterhouse, 2005).
DESCRIPTION OF THE TOOLS
miRCat: miRNA detection
miRCat identifies mature miRNAs and their precursors. Users
upload a FASTA file of sRNA sequences, which are mapped to
To whom correspondence should be addressed.
The authors wish it to be known that, in their opinion, the first two authors
should be regarded as joint First Authors.
a plant genome using PatMaN (Prfer et al., 2008) and grouped
into loci. To enrich for miRNA candidates, a number of empirical
and published criteria for bona fide miRNA loci are applied by the
software (Jones-Rhoades et al., 2006, details listed on the tools
website). In brief, the program searches for a two-peak alignment
pattern of sRNAs on one strand of the locus and assesses the
secondary structures of a series of putative precursor transcripts
using the RNAfold (Hofacker et al., 1994) and randfold (Bonnet
et al., 2004) programs. As a result, miRCat produces three files:
(i) a comma-separated text (csv) file with the details for predicted
miRNA candidates, (ii) the RNAfold output for candidate precursors
and (iii) a FASTA file of predicted mature miRNA sequences.
miRCat has been tested on several high-throughput plant sRNA
datasets and shows a high level of sensitivity and specificity.
When tested on a publicly available Arabidopsis leaf sRNA dataset
(GEO accession GSM118373; Rajagopalan et al., 2006) containing
186 899 sRNA sequences, miRCat predicted 89 miRNA loci using
default parameters. Eighty-three of these predictions were known
miRNA sequences and 6 novel miRNA loci were predicted (Fig. 1a).
There were 91 known miRNA loci with an sRNA abundance of five
or more (default threshold for miRCat) in the dataset. This shows
91.2% sensitivity and, even if all novel predictions would have been
false positives, this would give a specificity of 99.93% (8362 loci
tested). As a web-based tool, miRCat complements related software
developed for local installation and command line use, such as
a recently published program for discovering miRNAs in animal
datasets (Friedlnder et al., 2008).
SiLoCo: sRNA locus expression comparison
High-throughput sequencing can be used to compare sRNA
expression profiles under varying conditions or between mutants
and wild-type to gain insights into the biogenesis and function of
sRNAs. Plant sRNA populations are highly complex with many
genomic loci producing highly diverse sRNA populations. In such
cases, individual sequences may not be found more than once even
in very large datasets, thus making it necessary to group sRNAs by
their locus of origin in the genome and compare expression levels
on a locus, rather than individual sequence levels. Such an approach
also needs to take into account the degree of repetitiveness of
sRNA matches to the genome. SiLoCo identifies sRNA loci on plant
genomes from two sRNA datasets, which can be uploaded by the
user and/or selected from publicly available datasets. SiLoCo maps
sRNA sequences to the genome using PatMaN (Prfer et al., 2008)
and weighs each sRNA hit by its repetitiveness in the genome. Loci
are defined as described previously (Molnr et al., 2007; Mosher
et al., 2008) by a minimum number of sRNA hits to a region
and a maximum gap, i.e. absence of sRNA hits, between them.
Hit counts are normalized to the total number of genome-matching
reads in each sample to make them comparable. For each locus, the
log2 ratio and the average of the normalized sRNA hit counts are
calculated and ranked independently. A sum of the two ranks is also
provided and the results can be downloaded as a csv-formatted file.
Sorting the list of loci by the rank sum in a spreadsheet program
is an easy way of finding the best candidates for differentially
expressed loci, where sRNA abundance differs greatly at a high
overall expression level (Fig. 1b). Hyperlinks to some public genome
browsers can also be included in the result file.
2.3 ta-siRNA prediction
ta-siRNAs are produced from a double-stranded RNA molecule.
Alignments of ta-siRNAs to their region of origin exhibit a
characteristic phased pattern (Axtell et al., 2006) that can be
identified computationally. Our tool is a web-based implementation
of an algorithm proposed by Chen et al. (2007) for calculating
the probability of obtaining the observed percentage (or more) of
phased sRNA matches by chance. An adjustable P-value cutoff is
used to filter for loci with a significant degree of 21 nt phasing.
Results are downloadable as a csv file. A test run with a publicly
available Arabidopsis dataset (Rajagopalan et al., 2006) returned
eight candidate loci, including four known ta-siRNA loci and three
phased loci also reported by Chen et al. (2007).
2.4 Helper tools
We provide a web tool to find target transcripts of sRNAs based
on published rules for plant miRNAs (Allen et al., 2005; Schwab
et al., 2005). This tool allows batch searching of up to 50 sRNAs
against 20 different plant gene datasets. In addition, we provide an
interface to the RNAfold/RNAplot programs (Hofacker et al., 1994)
that allows the visualization of mi (...truncated)