LncRNAWiki: harnessing community knowledge in collaborative curation of human long non-coding RNAs

Nucleic Acids Research, Jan 2015

Long non-coding RNAs (lncRNAs) perform a diversity of functions in numerous important biological processes and are implicated in many human diseases. In this report we present lncRNAWiki (http://lncrna.big.ac.cn), a wiki-based platform that is open-content and publicly editable and aimed at community-based curation and collection of information on human lncRNAs. Current related databases are dependent primarily on curation by experts, making it laborious to annotate the exponentially accumulated information on lncRNAs, which inevitably requires collective efforts in community-based curation of lncRNAs. Unlike existing databases, lncRNAWiki features comprehensive integration of information on human lncRNAs obtained from multiple different resources and allows not only existing lncRNAs to be edited, updated and curated by different users but also the addition of newly identified lncRNAs by any user. It harnesses community collective knowledge in collecting, editing and annotating human lncRNAs and rewards community-curated efforts by providing explicit authorship based on quantified contributions. LncRNAWiki relies on the underling knowledge of scientific community for collective and collaborative curation of human lncRNAs and thus has the potential to serve as an up-to-date and comprehensive knowledgebase for human lncRNAs.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://nar.oxfordjournals.org/content/43/D1/D187.full.pdf

LncRNAWiki: harnessing community knowledge in collaborative curation of human long non-coding RNAs

Lina Ma 2 Ang Li 2 Dong Zou 2 Xingjian Xu 1 2 Lin Xia 1 2 Jun Yu 2 Vladimir B. Bajic 0 Zhang Zhang 2 0 Computational Bioscience Research Center (CBRC) , Computer , Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST) , Thuwal 23955-6900, Kingdom of Saudi Arabia 1 University of Chinese Academy of Sciences , Beijing 100049 , China 2 CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China Long non-coding RNAs (lncRNAs) perform a diversity of functions in numerous important biological processes and are implicated in many human diseases. In this report we present lncRNAWiki (http://lncrna.big.ac.cn), a wiki-based platform that is open-content and publicly editable and aimed at community-based curation and collection of information on human lncRNAs. Current related databases are dependent primarily on curation by experts, making it laborious to annotate the exponentially accumulated information on lncRNAs, which inevitably requires collective efforts in community-based curation of lncRNAs. Unlike existing databases, lncRNAWiki features comprehensive integration of information on human lncRNAs obtained from multiple different resources and allows not only existing lncRNAs to be edited, updated and curated by different users but also the addition of newly identified lncRNAs by any user. It harnesses community collective knowledge in collecting, editing and annotating human lncRNAs and rewards community-curated efforts by providing explicit authorship based on quantified contributions. LncRNAWiki relies on the underling knowledge of scientific community for collective and collaborative curation of human lncRNAs and thus has the potential to serve as an up-to-date and comprehensive knowledgebase for human lncRNAs. - In mammals, a small fraction of the genome (e.g. in human) is transcribed into messenger RNAs, whereas the most represents a transcribed dark matter that does not encode for proteins. Among them, long non-coding RNAs (lncRNAs) are prevalently transcribed from mammalian genomes and are present in large amounts in mammalian cells (14). Evidence has accumulated that lncRNAs play significant roles in numerous fundamental biological processes such as transcription, translation, cell cycle, imprinting, splicing and protein localization (57) and are highly implicated in cancer progression (814) and development of many other human diseases such as mendelian disorders, cardiovascular diseases and neurological disorders (1417). Advances in studies of non-coding RNA and consequently the increasing number of lncRNAs identified have resulted in the development of several lncRNA-related databases. Among them, GENCODE, which aims at annotating all functional elements in the human genome (4), has made a comprehensive annotation of gene structure (gene loci, transcript loci, exon number and splicing boundary) of 23 898 human lncRNA transcripts (Version 19). NONCODE (18) collects 95 135 lncRNA transcripts in human obtained from published literatures and databases (Version 4.0). LNCipedia (19) contains a total of 32 181 human lncRNA transcripts and incorporates related statistics such as protein-coding potential, secondary structure information and microRNA binding sites (Version 2.1). lncRNAdb (20) focuses on collecting function annotations based on the published literatures and, to date, only about 200 lncRNAs have been included in lncRNAdb. Another database, Rfam, centers on non-coding RNA families and thus does not provide specialized information for an individual lncRNA (21). It can be seen that although these existing databases offer valuable information on different aspects and different level of coverage of the lncRNA universe, there is a lack of a dedicated database for human lncRNAs that provides lncRNA transcript details coupled with the capability to make data update and curation in a smooth and easy way. Specifically, existing databases are most dependent on expert curation and thus laborious to comprehensively update the fast growing number of newly discovered lncRNAs. Wikipedia (http://www.wikipedia.org), an online encyclopedia, is an extraordinarily successful example that relies on the community knowledge in information integration and allows people from all over the world to create/edit any content. Wikipedia features collaborative information integration, huge coverage, up-to-date content as well as lowmaintenance cost. Many attempts have been made in application of wiki for biological data integration (2225). For example, Rfam, dedicated to RNA families, has adopted the wiki technology for community curation. According to its name, LNCipedia looks like a wiki resource, but in fact it is not fully open to the scientific community for data provision/edit. Considering the exponentially accumulated volume of lncRNAs, it is desirable to exploit the knowledge of the broad scientific community for collaborative integration and curation of lncRNA information (2630). Here we present our developed lncRNAWiki (http:// lncrna.big.ac.cn). This platform is wiki-based and opencontent, publicly editable and aimed at community curation of human lncRNAs. Unlike existing relevant databases, lncRNAWiki features comprehensive integration of information on human lncRNAs, cataloging 105 255 nonredundant lncRNA transcripts obtained from multiple different resources. Moreover, it harnesses collective knowledge for collecting, editing and annotating information on human lncRNAs and rewards community-curated efforts by quantifying contributions of users and providing explicit authorship based on their quantified contributions, aiming to exploit the knowledge of broad scientific community in addressing collectively collaborative curation of human lncRNAs. Therefore, lncRNAWiki has the potential to serve as an up-to-date and comprehensive knowledgebase for human lncRNAs. IMPLEMENTATION LncRNAWiki is built based on MediaWiki version 1.19.1 (http://www.mediawiki.org), which is an open source wiki engine, MySQL version 5.1.58 (http://www.mysql.org) a popular and free relational database management system, and PHP version 5.2.17 (http://www.php.net), which is a scripting language. These were implemented on a Red Hat Enterprise Linux Server. In order to make lncRNAWiki more attractive for participants from the broader scientific community in tasks of collaborative curation of lncRNAs, we installed AuthorReward (http://www.mediawiki. org/wiki/Extension:AuthorReward), an extension to MediaWiki that allows for obtaining customized functionalities (31). AuthorReward quantifies participants contribution considering both edit quality and edit quantity, and provides explicit authorship based on these quantified contributions. It has been successfully demonstrated in RiceWiki (32), where it has attracted more than 800 participants in collaborative curation of 600 genes. All extensions and software implemented in lncRNAWiki are accessible at http://lncrna.big.ac.cn/index.php/Special:Version. We also integrated JBrowse (version 1.11.4) (33,34) into lncRNAWiki to facilitate visualization of the genomic context and transcript structure for each lncRNA. We integrated lncRNA sequences and annotation information (e.g. genomic location, transcript structure) from three data sources: GENCODE (version 19; 23 898 human lncRNA transcripts), NONCODE (version 4.0; 95 135 human lncRNA transcripts) and LNCipedia (version 2.1; 32 181 human lncRNA transcripts). A process of error and redundancy elimination was performed on the integrated data set. First, we removed sequences containing N in each data source, and as a result, a total of eight lncRNAs in LNCipedia were removed. Second, we excluded lncRNAs with ambiguous naming scheme; in each data source, two or more lncRNA transcripts having 100% sequence identity on the whole transcript length (based on blastn results) and occupying the same genomic location but having different IDs are considered as questionable lncRNAs. Consequently, 14, 20 and eight lncRNAs were removed from GENCODE, NONCODE and LNCipedia, respectively. Lastly, since different databases may have different naming schemes and a given lncRNA transcript may accordingly have different identifiers in different databases, we performed blastn across these three data sources. LncRNA transcripts having 100% sequence identity (based on blastn results) and occupying the same genomic location were regarded as the same lncRNA. Finally, we obtained a total of 105 255 nonredundant lncRNA transcripts (Figure 1). We also blasted these 105 255 lncRNAs against lncRNA sequences in lncRNAdb (223 lncRNAs in total as of July 21, 2014) and found only 103 lncRNAs have been functionally annotated (Supplementary Table S1), indicating that a large number of human lncRNAs are poorly annotated and need a platform for community annotation of lncRNAs. Based on our previous study (7) and categories of Derrien et al. (35), we classified the 105 255 non-redundant lncRNA transcripts into seven groups according to their genomic location in respect to protein-coding genes, viz., Intergenic, Intronic (S), Intronic (AS), Overlapping (S), Overlapping (AS), Sense and Antisense (Figure 2). The difference between our classification and Derriens (35) is that we classified lncRNAs that intersect protein-coding genes into Sense or Antisense by considering the whole transcript sequence instead of exonic region only. DATABASE CONTENT The central entities of lncRNAWiki are human lncRNA transcripts. Thus, each transcript has a corresponding wiki page (Figure 3). Meanwhile, transcripts are grouped together by classification categories, as well as by genes. This information can be accessed at the homepage and at the bottom of the transcript page. Transcript-specific pages are generated based on transcript identifiers. The content of every transcript in lncRNAWiki is structured into two parts: user-edit part and Basic Information part. The user-edit part allows users to add or delete annotations. On the contrary, the Basic Information portion is organized as a table and will be regularly updated by the lncRNAWiki team based on the annotation information integrated from multiple different lncRNA-associated sources. The lncRNA information in lncRNAWiki was seeded from GENCODE, NONCODE and LNCipedia, yielding a comprehensive integration of 105 255 non-redundant lncRNA transcripts (Figure 1). Basic Information provides users with the basic details of lncRNA such as genomic location, transcript structure and sequence. There are 10 sub-sections in Basic Information, including Transcript ID, Source, Same with, Classification, Length, Genomic location, Exon number, Exons, Genome context and Sequence. Transcript ID refers to the lncRNA ID in the data source. Source indicates the source database, as well as its version, from which this lncRNA is obtained. Same with provides IDs of lncRNAs that are considered to be the same entry in other lncRNA sequence databases. Taking into account that the genomic context of lncRNAs may offer insights into their function, the classification of lncRNAs based on genomic location is of great biological significance in indepth mining and analysis (7). We classified lncRNAs into seven categories considering their genomic location in respect of protein-coding genes, i.e. Sense, Antisense, Overlapping (S), Overlapping (AS), Intergenic, Intronic (S) and Intronic (AS) (Figure 2A). According to the present data set, it is shown that the majority of human lncRNAs belong to the categories of Intergenic (59.2%) and Sense (24.4%) (Figure 2B). The sub-section Genome context facilitates visual inspection of the transcript structure and genomic context. The user-edit part includes three sections: Annotated Information, Labs working on this lncRNA and References (Figure 3). Annotated Information is represented in a form of free text. It is helpful for users who do not have training in wiki techniques or curation to contribute edits and share knowledge, which simplifies editing significantly and lowers technological requirements for participation in curation of a wider community. Annotated Information links to several sub-sections including Function, Expression, Regulation, Diseases and Evolution, providing convenience of directing users to other sub-section(s) of interest. Considering the importance and the necessity of lncRNA nomenclature, we added the sub-section of Transcriptomic Nomeclature. As most of the lncRNAs have not been functionally studied, we named lncRNAs based on their biological features such as genomic location, alternative splicing and expression level by basically following the rules of HGNC (36) (Figure 3). Users can add new sub-sections if necessary, while those that are irrelevant can be deleted. This enables intuitive editing of the information through an edit link available to each sub-section and, moreover, can be made by using application programming interface for automatic entry of information. Labs working on this lncRNA contains a worldwide list of laboratories that work on this lncRNA, thus facilitating collaboration and interaction in curation of this lncRNA. References provides publications related to the lncRNA and they are automatically formatted using the Cite extension (http://www.mediawiki.org/wiki/Extension:Cite). In lncRNAWiki, community-curated efforts are quantified and rewarded by explicit authorship, aiming to encourage more participants from the wider scientific community in collective and collaborative curation of lncRNAs. In any given lncRNA page, curation efforts for all participated contributors are quantified as contribution score which evaluates both quality and quantity of edits, and consequently, authorship is awarded to any contributor from the scientific community whose contribution score is greater than a cutoff score (by default, it is 1) (Figure 3). Each page at the top displays the brief authorship information, including contributor name(s), lncRNA ID, hyperlink to this lncRNA and last update time. The detailed authorship information is presented at the bottom of each page. This information includes a pie chart to depict contribution scores for all involved contributors, edit quality and quantity for each contributor are illustrated by a histogram, and contributor names are listed in a table with contribution score, edit count, edit quality, edit quantity, last edit time and edit details. In addition, when a newly identified human lncRNA is reported, any user can create a new page to add specific information for this lncRNA, enabling lncRNAWiki to become an up-to-date and comprehensive knowledgebase for human lncRNAs. For example, we added to lncRNAWiki (http://lncrna.big.ac.cn/index.php/LUNAR1) LUNAR1, a recently discovered human lncRNA, immediately after the relevant paper was published online. LUNAR1 is Notchregulated and it enhances mRNA expression of IGF1R (insulin-like growth factor receptor 1) aiming at maintenance of human T-cell acute lymphoblastic leukemia (TALL) (37). LncRNAWiki is a wiki-based database dedicated for human lncRNAs, comprehensively integrating information on human lncRNAs from multiple different resources and exploiting the wide scientific community to collect, edit and annotate human lncRNAs. It allows not only existing lncRNAs to be edited, updated and curated by different users but also newly identified lncRNAs to be added by any user. As the number of lncRNAs grows fast and is contrasted by the small number of expert curators focused on lncRNA, lncRNAWiki has the potential to serve as an up-to-date and comprehensive knowledgebase for human lncRNAs. It should be noted, however, that lncRNAWiki does not aim to replace traditionally expert-curated databases, but represents their important complement. LncRNAWiki relies on community intelligence to curate a wide range of lncRNA-related topics, which thus can significantly reduce efforts and time of expert curators. With explicit authorship as a reward for community curation, lncRNAWiki bears the promise to attract more people (especially field experts) to share their expertise and to provide edits on lncRNAs of their interest. In addition, it is of great significance for authors of recent publications to curate their newly reported lncRNAs and to submit their functional descriptions to lncRNAWiki, facilitating information dissemination and maximizing the scope of knowledge sharing. Together, based on community curation, lncRNAWiki has a potential to grow into an lncRNA encyclopedia by the community, of the community and for the community. Future directions for lncRNAWiki include integrating more types of data in the section of Basic information (e.g. expression level, tissue-specific expression, orthologs) from different sources and improving links to existing relevant databases, such as the lncRNA-associated disease database LncRNADisease (38) and lncRNA expression database NRED (39). To facilitate the detection and annotation of lncRNAs, we will also integrate analysis tools into lncRNAWiki for lncRNA detection and classification and employ automatic text mining in aid of lncRNA-related literature curation. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. We thank anonymous reviewers for their valuable comments and members of the Zhang Lab for reporting bugs and sending comments. Strategic Priority Research Program of the Chinese Academy of Sciences [XDB13040500 to Z.Z.]; National Natural Science Foundation of China [31200978 to L.M.]; Base Research Fund of King Abdullah University of Science and Technology [to V.B.B.]; 100-Talent Program of Chinese Academy of Sciences [Y1SLXb1365 to Z.Z.]. Funding for open access charge: Strategic Priority Research Program of the Chinese Academy of Sciences [XDB13040000]. Conflict of interest statement. None declared.


This is a preview of a remote PDF: https://nar.oxfordjournals.org/content/43/D1/D187.full.pdf

Lina Ma, Ang Li, Dong Zou, Xingjian Xu, Lin Xia, Jun Yu, Vladimir B. Bajic, Zhang Zhang. LncRNAWiki: harnessing community knowledge in collaborative curation of human long non-coding RNAs, Nucleic Acids Research, 2015, D187-D192, DOI: 10.1093/nar/gku1167