PLncDB: plant long non-coding RNA database

Bioinformatics, Apr 2013

Summary: Plant long non-coding RNA database (PLncDB) attempts to provide the following functions related to long non-coding RNAs (lncRNAs): (i) Genomic information for a large number of lncRNAs collected from various resources; (ii) an online genome browser for plant lncRNAs based on a platform similar to that of the UCSC Genome Browser; (iii) Integration of transcriptome datasets derived from various samples including different tissues, developmental stages, mutants and stress treatments; and (iv) A list of epigenetic modification datasets and small RNA datasets. Currently, our PLncDB provides a comprehensive genomic view of Arabidopsis lncRNAs for the plant research community. This database will be regularly updated with new plant genome when available so as to greatly facilitate future investigations on plant lncRNAs. Availability: PLncDB is freely accessible at http://chualab.rockefeller.edu/gbrowse2/homepage.html and all results can be downloaded for free at the website. Contact: chua{at}rockefeller.edu

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://bioinformatics.oxfordjournals.org/content/29/8/1068.full.pdf

PLncDB: plant long non-coding RNA database

Jingjing Jin 0 1 Jun Liu 1 Huan Wang 1 Limsoon Wong 0 Nam-Hai Chua 1 Associate Editor: Ivo Hofacker 0 School of Computing, National University of Singapore , 117417 Singapore 1 Laboratory of Plant Molecular Biology, Rockefeller University , New York, NY 10065 , USA Summary: Plant long non-coding RNA database (PLncDB) attempts to provide the following functions related to long non-coding RNAs (lncRNAs): (i) Genomic information for a large number of lncRNAs collected from various resources; (ii) an online genome browser for plant lncRNAs based on a platform similar to that of the UCSC Genome Browser; (iii) Integration of transcriptome datasets derived from various samples including different tissues, developmental stages, mutants and stress treatments; and (iv) A list of epigenetic modification datasets and small RNA datasets. Currently, our PLncDB provides a comprehensive genomic view of Arabidopsis lncRNAs for the plant research community. This database will be regularly updated with new plant genome when available so as to greatly facilitate future investigations on plant lncRNAs. Availability: PLncDB is freely accessible at http://chualab.rockefeller. edu/gbrowse2/homepage.html and all results can be downloaded for free at the website. Contact: The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: - Received on January 7, 2013; revised on January 18, 2013; accepted on February 22, 2013 1 INTRODUCTION Non-coding RNAs (ncRNAs) are a family of RNAs that do not encode proteins. On the basis of their length and genomic locations, ncRNAs can be further classified as (i) small ncRNAs including miRNAs and small interfering RNAs (siRNAs); (ii) natural antisense transcripts (NATs) (Wang et al., 2005, 2006; Zhang et al., 2012); (iii) long intronic non-coding RNAs (incRNAs); and (iv) long intergenic non-coding RNAs (lincRNAs) (Guttman et al., 2009; Liu et al., 2012). RNAs in the last three categories are at least 200 nt or longer and they are referred to as long non-coding RNAs (lncRNAs). Genomes of human (Gupta et al., 2010; Khalil et al., 2009), mouse (Dinger et al., 2008) and fly (Tupy et al., 2005) have been shown to encode lncRNAs that play important roles in cell differentiation, immune response, imprinting, tumor genesis and other important biological processes (Dinger et al., 2008; Gupta et al., 2010; Khalil et al., 2009; Liao et al., 2011a; Wilusz et al., 2009). Besides, genetic mutations of human lncRNAs have been shown to be associated with diseases and pathophysiological conditions (Cabianca et al., 2012; Gupta et al., 2010; Hu et al., 2011). For plants, genome-wide search for ncRNAs has been previously conducted in Arabidopsis thaliana (MacIntosh et al., 2001; *To whom correspondence should be addressed. Marker et al., 2002; Rymarquis et al., 2008; Song et al., 2009), Medicago truncatula (Wen et al., 2007), Zea mays (Boerner and McGinnis, 2012) and Tritucum aestivum (Xin et al., 2011). The recent genome-wide study based on around 200 Arabidopsis tilling array datasets and RNA sequencing (RNA-seq) has identified thousands of lncRNAs in Arabidopsis (Liu et al., 2012). These lncRNAs show tissue-specific expression, and a large number of them are responsive to abiotic stresses (Liu et al., 2012). However, the function of these lncRNAs remains largely unexplored. Genomic loci of many lncRNAs are associated with histone modifications and DNA methylations suggesting an epigenetic regulation of these loci (Guttman et al., 2009; Liu et al., 2012). In addition, biogenesis of a subgroup of lncRNAs is co-regulated by CBP20, CBP80 and SERRATE (Liu et al., 2012). Some sense and antisense double-stranded RNAs involving lncRNA partners are processed by the RNA interference machinery into siRNAs (Zhang et al., 2012). Although thousands of lncRNAs have been identified in Arabidopsis and other plants and their expression has been profiled on a genome-wide basis, these RNAs have not been fully recorded and annotated in public databases. As far as we know, there are only seven databases and one server related to currently available lncRNAs: TAIR (Swarbreck et al., 2008), PlantNATsDB (Chen et al., 2012), lncRNAdb (Amaral et al., 2011), NRED (Dinger et al., 2009), ncRNAimprint (Zhang et al., 2010), NONCODE (Bu et al., 2012) and ncFANs (Liao et al., 2011b). Among them, only PlantNATsDB (Chen et al., 2012) is designed to query about NATs pairs; however, this database just lists all NATs pair and does not provide a genome view. The other six databases are not specifically designed for plant lncRNAs (Table 1). Therefore, a database that contains comprehensive information related to lncRNAs, such as genomic information, expression profiles, siRNA information and associated epigenetic markers is warranted. Here, we attempt to develop an online database for plant lncRNAs, named PLncDB (Plant long non-coding RNA database), with the aim to provide comprehensive information for plant lncRNAs. Table 1 compares information content between our database and those of others. AIMS OF DATABASE Recent studies in mammalian genomes have shown that lncRNAs are generally characterized by four interesting features: (i) eukaryotic genome codes a few thousand lincRNAs (Cabili et al., 2011; Dinger et al., 2008; Guttman et al., 2009); Expression Source Tiling, lincRNA arrays RNA-seq Organism Dataset lncRNA (RepTAS) Flower/root/leaf lincRNA Pri-miRNAs Protein-coding gene DNA methylation Met1/DDC Protein coding gene Small RNA Histone modification Dataset1 (WT and VIP3) H3K27me3/2 H3K36me2/H3K4me3 Dataset2 (WT and Met1) Number Description 4915 3718 Tilling array Tilling array Genome Genome lincRNA array TAIR 10 siRNA sequence Sookyung (Oh et al., 2008) Xiaoyu Z (Zhang et al., 2009) Tiling (Matsui et al., 2008) (ii) lncRNA genes are expressed in a temporal and/or spatial specific manner (Dinger et al., 2008; Managadze et al., 2011); (iii) genomic loci encoding lncRNAs are associated with epigenetic markers (Guttman et al., 2009; Khalil et al., 2009); (iv) sense and antisense transcripts double-stranded structure may be processed into siRNAs (Zhang et al., 2012). Based on the characteristics of lncRNAs, our PlncDB aims to provide the following four essential functions: (i) a collection and integration of lncRNAs from different data resources; (ii) lncRNA expression levels in various samples including different tissues, developmental stages, mutants and stress treatments; (iii) epigenetic modifications (e.g. DNA methylations and histone modifications) on lncRNA-encoding loci and their flanking genomic regions; and (iv) a collection of siRNA sequencing dataset across the whole genome (Table 2). DATABASE ACCESS We constructed a genome browser database using the open source GBrowse library (Stein et al., 2002) to integrate and visualize these different sources PLncDB. In the case of Arabidopsis, we have also provided an updated version from TAIR10 with respect to genomic context, alignment information, protein coding gene annotation and known ncRNAs. As for lncRNA expression information, we adopted a new file format BigWig (Kent et al., 2010) to expedite the querying. The database can be accessed or queried in various ways. Just by clicking on a specific lncRNA, one can visualize related mutant/stress information (Fig. 1). Specific searches can be performed using the name/keywords of gene/protein and/or location on the chromosome. At the same time the entire database is available for download in different format on the website. FUNCTION OF THE DATABASE An online database to deposit, browse and download information relating to a large number of lncRNAs We collected a total of 16 227 Arabidopsis lncRNAs from various resources published in the past decade (Liu et al., 2012). These lncRNAs were identified based on different versions of genome sequences and were annotated separately using different criteria. For our Reproducibility-based Tiling array Analysis Strategy (RepTAS) method, 13 466 transcript units (TU) were identified (Liu et al., 2012). To provide uniformed and comprehensive information for Arabidopsis lncRNAs, by comparing the genomic loci of TUs with exons, pseudogenes, repeat sequences and transposable elements annotated in TAIR10, we finally reclassified the remaining TUs into the following six categories: (i) TU encoding NATs, (ii) Repeats-Containing TUs, (iii) Gene-Associated TU, (iv) TUs encoding transcripts with long open reading frames suggesting novel protein-coding genes, also named TUs of Unknown Coding Potential (Cabili et al., 2011); (v) TUs for lincRNAs; (vi) Other Intergenic TUs. Recently, using a RepTAS, we identified 6480 genes encoding lincRNAs (Liu et al., 2012). An online genome browser to show lncRNA expression of various transcriptome data An interesting feature of lncRNAs is their significant tissuespecific expression pattern compared with mRNAs Fig. 1. Snapshot of PLncDB (Cabili et al., 2011; Dinger et al., 2008). In our previous study, we profiled transcriptome of Arabidopsis lncRNAs using multiple platforms including high-throughput RNA-seq, tiling-arrays and a custom designed long oligonucleotide microarray (Liu et al., 2012). Dynamic expression patterns of thousands of lncRNA were found in a number of tissues, developmental time-points, biotic/abiotic stress conditions and Arabidopsis mutants deficient in several RNA-interacting proteins. It is reasonable to hypothesize that the expression specificity of lncRNA may imply biological functions. Therefore, an online genome browser with a large set of transcriptome data will be useful to biologists to further investigate functional roles of the lincRNAs. In this study, we collected datasets of three transcriptome detection platforms and analyzed their data quality. The raw datasets of the selective samples were normalized and re-analyzed using a uniformed analysis protocol. We then integrated the processed signal intensities into the genome browser (Fig. 1). The current version of PLncDB is version 1. An Association of lncRNA-encoding genomic loci with epigenetic markers Expression of thousands of lncRNAs is associated with epigenetic regulation. Yet, little is known about epigenetic regulations of lncRNA expression in plants. Two widely used strategies have been used to profile epigenetic regulation in a genome-wide view: (i) detection of epigenetic modification sites or modifier binding sites using immunoprecipitation followed by high-throughput detection such as Chromatin Immunoprecipitation on chip (ChIP-chip) and Chip-sequencing (ChIP-seq). (ii) Profiling of transcriptome changes in mutants deficient in epigenetic modifiers. In our study, we analyzed ChIP-chip data of the following histone modifications (H3K27me3, H3K4me3, H3K36me3 and H3K9me3). Besides, we also profiled expression changes of lncRNAs in a number of RdRM-related mutants (RDD, DCL1/2/3/4, AGO4, RDR2 and DMS1) (Fig. 2). These results have been integrated into the genome browser of PLncDB for public access (Figs 1 and 2). Fig. 2. An example: detailed expression for At5NC081830 Sense and antisense double strands are precursor of siRNA Sense and antisense transcripts may form double-stranded RNAs that are subsequently processed by the RNA interference machinery into siRNAs (Zhang et al., 2012). A few so-called nat-siRNAs have been reported in plants, mammals, Drosophila and yeasts. However, many questions remain regarding the features and biogenesis of nat-siRNAs (Luo et al., 2009; Wang et al., 2006; Zhang et al., 2012). For this reason, we also included our previous small RNAs sequence dataset in this database (Wang et al., 2011). FUTURE DIRECTION In future, we also plan to include in our database other lncRNA datasets, like those for intronic non-coding RNAs (incRNAs) and Natural Antisense RNAs (NATs). Furthermore, information on lncRNAs of other plant species, e.g. rice, corn, etc, will be also included as the data become available. We thank all peoples contribution on this work. Funding: This work was supported by Singapore Ministry of Education Tier-2 grant MOE2009-T2-2-004 to L.S.W. and NIH GM44640 and the Cooperative Research Program for Agriculture Science & Technology Development (Project No. PJ906910) Rural Development Administration, Republic of Korea to N.-H.C. Conflict of Interest: none declared.


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/29/8/1068.full.pdf

Jingjing Jin, Jun Liu, Huan Wang, Limsoon Wong, Nam-Hai Chua. PLncDB: plant long non-coding RNA database, Bioinformatics, 2013, 1068-1071, DOI: 10.1093/bioinformatics/btt107