The i5k Workspace@NAL—enabling genomic data access, visualization and curation of arthropod genomes

Nucleic Acids Research, Jan 2015

The 5000 arthropod genomes initiative (i5k) has tasked itself with coordinating the sequencing of 5000 insect or related arthropod genomes. The resulting influx of data, mostly from small research groups or communities with little bioinformatics experience, will require visualization, dissemination and curation, preferably from a centralized platform. The National Agricultural Library (NAL) has implemented the i5k Workspace@NAL (http://i5k.nal.usda.gov/) to help meet the i5k initiative's genome hosting needs. Any i5k member is encouraged to contact the i5k Workspace with their genome project details. Once submitted, new content will be accessible via organism pages, genome browsers and BLAST search engines, which are implemented via the open-source Tripal framework, a web interface for the underlying Chado database schema. We also implement the Web Apollo software for groups that choose to curate gene models. New content will add to the existing body of 35 arthropod species, which include species relevant for many aspects of arthropod genomic research, including agriculture, invasion biology, systematics, ecology and evolution, and developmental research.

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/43/D1/D714.full.pdf

The i5k Workspace@NAL—enabling genomic data access, visualization and curation of arthropod genomes

Monica Poelchau 2 Christopher Childers 2 Gary Moore 2 Vijaya Tsavatapalli 2 Jay Evans 1 Chien-Yueh Lee 0 2 Han Lin 0 2 Jun-Wei Lin 2 4 Kevin Hackett 3 0 Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University , Taipei 10617 , Taiwan 1 Bee Research Laboratory, U.S. Department of Agriculture-Agricultural Research Service , Beltsville, MD 20705 , USA 2 National Agricultural Library , Beltsville, MD 20705 , USA 3 Crop Production and Protection, U.S. Department of Agriculture-Agricultural Research Service , Beltsville, MD 20705 , USA 4 Graduate Institute of Electrical Engineering, National Taiwan University , Taipei 10617 , Taiwan The 5000 arthropod genomes initiative (i5k) has tasked itself with coordinating the sequencing of 5000 insect or related arthropod genomes. The resulting influx of data, mostly from small research groups or communities with little bioinformatics experience, will require visualization, dissemination and curation, preferably from a centralized platform. The National Agricultural Library (NAL) has implemented the i5k Workspace@NAL (http://i5k.nal.usda. gov/) to help meet the i5k initiative's genome hosting needs. Any i5k member is encouraged to contact the i5k Workspace with their genome project details. Once submitted, new content will be accessible via organism pages, genome browsers and BLAST search engines, which are implemented via the opensource Tripal framework, a web interface for the underlying Chado database schema. We also implement the Web Apollo software for groups that choose to curate gene models. New content will add to the existing body of 35 arthropod species, which include species relevant for many aspects of arthropod genomic research, including agriculture, invasion biology, systematics, ecology and evolution, and developmental research. - INTRODUCTION Insects are an incredibly diverse class, with over 1 million described species. They provide essential pollination services for agriculture (1,2), yet cause substantial damage to crops (3), and are vectors of devastating diseases (4). Further, they are important ecological, evolutionary, developmental and medical models. The genomes of insects sequenced to date have already provided important insights into genome architecture and evolution (57), immune response pathways (8), eusociality (9,10) and speciation (11,12). Sequencing further genomes brings great promise to answer pending questions for basic and applied biology. Decreasing whole-genome sequencing costs, coupled with moderate genome sizes for insects and their relatives, favor new genome projects for this group. One downside for arthropod comparative genomic research is the great divergence times between arthropod groups, often in the hundreds of millions of years. Comparative analyses would therefore benefit tremendously from sequence information across the breadth of Arthropoda. The 5000 arthropod genomes initiative (i5k) has therefore set the goal to coordinate the sequencing of 5000 insect or related arthropod species (13,14). As such, the i5k initiative should galvanize the generation of large amounts of data of exceptional comparative value. While the i5k initiative will provide guidance on the sequencing of genomes, the onus is still on individual labs with a specific interest in these genomes to organize the sequencing, analysis and curation of their genome projects. To ensure wide re-use of genomic data resulting from the i5k project, it is important that this data is hosted in a centralized location for data sharing, dissemination, visualization and curation. However, developing, deploying and maintaining databases and web servers for genome access and curation (genome portals) is often beyond the financial and technical reach of smaller genome projects. Here, we introduce the i5k Workspace@NAL (https:// i5k.nal.usda.gov), which the National Agricultural Library (NAL) designed to meet the genome hosting needs of the i5k community. The i5k Workspace@NAL has two main goals. First, it aims to help the i5k data producers, in particular orphaned groups without the technical or financial means for genome hosting, at the interface of sequence retrieval and analysisi.e. how do you access, visualize, curate and disseminate your data once you have received it from the sequencing center? Second, we aim to provide a unified framework for data consumers to retrieve relevant genomic information from our data providers. We outline the i5k Workspace, and explain the steps that we take to help i5k data producers disseminate and curate their genome assemblies, and how we present this content to the i5k data consumers. Finally, we explore the future directions the i5k Workspace@NAL will take to continue improving its services for the i5k community, and the arthropod genomics community at large. USING THE i5k WORKSPACE@NAL Data producers Setting up your genome portal. The i5k@NAL hosts genome assemblies for any arthropod genome project that requires our services. There is no hosting preference for a particular taxonomic group or application, but we ask that no agreements with other genome portals be in place, to avoid redundant hosting and curation efforts. Our only requirement is a genome assemblypreferably approved by NCBI (the U.S. National Institutes of Health National Center for Biotechnology Information) or other INSDC (International Nucleotide Sequence Database Collaboration) membersin FASTA format (scaffold, contig and .agp mapping files). Any sequence features that have been mapped to this assembly can also be submitted, including but not limited to official gene sets (OGS), other consensus gene sets or gene predictions, homology alignments, RNASeq mappings and transcriptomes (Figure 1). We provide a tutorial on how to map RNA-Seq reads to a genome in iPlant (https://i5k.nal.usda.gov/content/performing-rnaseq-alignments-iplant-baylor-i5k-pilot), and have developed an extension of the exonerate alignment program (15) that generates gff3-formatted output (https: //github.com/hotdogee/exonerate-gff3), allowing users to map moderately-sized transcriptomes against a genome assembly. We will also transform BAM files to BigWig format in-house if requested. We can advise on what files would be useful to visualize, given the genome communitys specific needs. Finally, we ask for genome communities to provide us with information to populate the organisms landing page, as well as metadata about each file given to us to communicate to other users of the data. Data can be transferred to us via ftp or iPlant (https://i5k.nal.usda.gov/ content/sharing-files-us). We recommend that each genome community designate a community contact to serve as the main contact for data files, as well as mediate the manual curation process if this is to be a part of the genome project. Data processingwhat we do with your data. For each new organism and genome assembly, we generate customized organism landing pages ( (...truncated)


This is a preview of a remote PDF: https://nar.oxfordjournals.org/content/43/D1/D714.full.pdf
Article home page: http://nar.oxfordjournals.org/content/43/D1/D714.abstract

Monica Poelchau, Christopher Childers, Gary Moore, Vijaya Tsavatapalli, Jay Evans, Chien-Yueh Lee, Han Lin, Jun-Wei Lin, Kevin Hackett. The i5k Workspace@NAL—enabling genomic data access, visualization and curation of arthropod genomes, Nucleic Acids Research, 2015, pp. D714-D719, 43/D1, DOI: 10.1093/nar/gku983