A toolkit for analysing large-scale plant small RNA datasets (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/24/19/2252.full.pdf

A toolkit for analysing large-scale plant small RNA datasets

Simon Moxon 2 Frank Schwach 2 Tamas Dalmay 1 Dan MacLean 0 David J. Studholme 0 Vincent Moulton 2 Associate Editor: Ivo Hofacker 0 The Sainsbury Laboratory , Colney Lane, Norwich, NR4 7UH, UK 1 School of Biological Sciences, University of East Anglia , Norwich, NR4 7TJ 2 School of Computing Sciences Summary: Recent developments in high-throughput sequencing technologies have generated considerable demand for tools to analyse large datasets of small RNA sequences. Here, we describe a suite of web-based tools for processing plant small RNA datasets. Our tools can be used to identify micro RNAs and their targets, compare expression levels in sRNA loci, and find putative transacting siRNA loci. Availability: The tools are freely available for use at http://srnatools.cmp.uea.ac.uk Contact: - INTRODUCTION Several classes of small (2030 nt) non-coding RNAs (sRNAs) can be distinguished by biogenesis and function in post-transcriptional gene regulation and epigenetic control in plants, animals and fungi (for reviews see: Brodersen and Voinnet, 2006; Lippman and Martienssen, 2004). Micro RNAs (miRNAs) and transacting siRNAs (ta-siRNAs) are two important classes of sRNAs that both induce post-transcriptional silencing of target genes. Computationally, miRNAs can be identified by their characteristic fold-back precursors, while ta-siRNA are found by a phased alignment pattern at their genomic regions of origin (Axtell et al., 2006). Novel high-throughput sequencing technologies greatly facilitate small RNA detection and analysis (Hafner et al., 2007). However, the lack of supporting data analysis tools presents a major bottleneck. Here, we present an easy-to-use web-based toolkit that is specifically geared towards the analysis of large-scale plant sRNA datasets. Plant specific tools are necessary due to important differences in the biogenesis and mode of action between plant and animal sRNAs (Millar and Waterhouse, 2005). DESCRIPTION OF THE TOOLS miRCat: miRNA detection miRCat identifies mature miRNAs and their precursors. Users upload a FASTA file of sRNA sequences, which are mapped to To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. a plant genome using PatMaN (Prfer et al., 2008) and grouped into loci. To enrich for miRNA candidates, a number of empirical and published criteria for bona fide miRNA loci are applied by the software (Jones-Rhoades et al., 2006, details listed on the tools website). In brief, the program searches for a two-peak alignment pattern of sRNAs on one strand of the locus and assesses the secondary structures of a series of putative precursor transcripts using the RNAfold (Hofacker et al., 1994) and randfold (Bonnet et al., 2004) programs. As a result, miRCat produces three files: (i) a comma-separated text (csv) file with the details for predicted miRNA candidates, (ii) the RNAfold output for candidate precursors and (iii) a FASTA file of predicted mature miRNA sequences. miRCat has been tested on several high-throughput plant sRNA datasets and shows a high level of sensitivity and specificity. When tested on a publicly available Arabidopsis leaf sRNA dataset (GEO accession GSM118373; Rajagopalan et al., 2006) containing 186 899 sRNA sequences, miRCat predicted 89 miRNA loci using default parameters. Eighty-three of these predictions were known miRNA sequences and 6 novel miRNA loci were predicted (Fig. 1a). There were 91 known miRNA loci with an sRNA abundance of five or more (default threshold for miRCat) in the dataset. This shows 91.2% sensitivity and, even if all novel predictions would have been false positives, this would give a specificity of 99.93% (8362 loci tested). As a web-based tool, miRCat complements related software developed for local installation and command line use, such as a recently published program for discovering miRNAs in animal datasets (Friedlnder et al., 2008). SiLoCo: sRNA locus expression comparison High-throughput sequencing can be used to compare sRNA expression profiles under varying conditions or between mutants and wild-type to gain insights into the biogenesis and function of sRNAs. Plant sRNA populations are highly complex with many genomic loci producing highly diverse sRNA populations. In such cases, individual sequences may not be found more than once even in very large datasets, thus making it necessary to group sRNAs by their locus of origin in the genome and compare expression levels on a locus, rather than individual sequence levels. Such an approach also needs to take into account the degree of repetitiveness of sRNA matches to the genome. SiLoCo identifies sRNA loci on plant genomes from two sRNA datasets, which can be uploaded by the user and/or selected from publicly available datasets. SiLoCo maps sRNA sequences to the genome using PatMaN (Prfer et al., 2008) and weighs each sRNA hit by its repetitiveness in the genome. Loci are defined as described previously (Molnr et al., 2007; Mosher et al., 2008) by a minimum number of sRNA hits to a region and a maximum gap, i.e. absence of sRNA hits, between them. Hit counts are normalized to the total number of genome-matching reads in each sample to make them comparable. For each locus, the log2 ratio and the average of the normalized sRNA hit counts are calculated and ranked independently. A sum of the two ranks is also provided and the results can be downloaded as a csv-formatted file. Sorting the list of loci by the rank sum in a spreadsheet program is an easy way of finding the best candidates for differentially expressed loci, where sRNA abundance differs greatly at a high overall expression level (Fig. 1b). Hyperlinks to some public genome browsers can also be included in the result file. 2.3 ta-siRNA prediction ta-siRNAs are produced from a double-stranded RNA molecule. Alignments of ta-siRNAs to their region of origin exhibit a characteristic phased pattern (Axtell et al., 2006) that can be identified computationally. Our tool is a web-based implementation of an algorithm proposed by Chen et al. (2007) for calculating the probability of obtaining the observed percentage (or more) of phased sRNA matches by chance. An adjustable P-value cutoff is used to filter for loci with a significant degree of 21 nt phasing. Results are downloadable as a csv file. A test run with a publicly available Arabidopsis dataset (Rajagopalan et al., 2006) returned eight candidate loci, including four known ta-siRNA loci and three phased loci also reported by Chen et al. (2007). 2.4 Helper tools We provide a web tool to find target transcripts of sRNAs based on published rules for plant miRNAs (Allen et al., 2005; Schwab et al., 2005). This tool allows batch searching of up to 50 sRNAs against 20 different plant gene datasets. In addition, we provide an interface to the RNAfold/RNAplot programs (Hofacker et al., 1994) that allows the visualization of mi (...truncated)