The i5k Workspace@NAL—enabling genomic data access, visualization and curation of arthropod genomes
Monica Poelchau
2
Christopher Childers
2
Gary Moore
2
Vijaya Tsavatapalli
2
Jay Evans
1
Chien-Yueh Lee
0
2
Han Lin
0
2
Jun-Wei Lin
2
4
Kevin Hackett
3
0
Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University
,
Taipei 10617
,
Taiwan
1
Bee Research Laboratory, U.S. Department of Agriculture-Agricultural Research Service
,
Beltsville, MD 20705
,
USA
2
National Agricultural Library
,
Beltsville, MD 20705
,
USA
3
Crop Production and Protection, U.S. Department of Agriculture-Agricultural Research Service
,
Beltsville, MD 20705
,
USA
4
Graduate Institute of Electrical Engineering, National Taiwan University
,
Taipei 10617
,
Taiwan
The 5000 arthropod genomes initiative (i5k) has tasked itself with coordinating the sequencing of 5000 insect or related arthropod genomes. The resulting influx of data, mostly from small research groups or communities with little bioinformatics experience, will require visualization, dissemination and curation, preferably from a centralized platform. The National Agricultural Library (NAL) has implemented the i5k Workspace@NAL (http://i5k.nal.usda. gov/) to help meet the i5k initiative's genome hosting needs. Any i5k member is encouraged to contact the i5k Workspace with their genome project details. Once submitted, new content will be accessible via organism pages, genome browsers and BLAST search engines, which are implemented via the opensource Tripal framework, a web interface for the underlying Chado database schema. We also implement the Web Apollo software for groups that choose to curate gene models. New content will add to the existing body of 35 arthropod species, which include species relevant for many aspects of arthropod genomic research, including agriculture, invasion biology, systematics, ecology and evolution, and developmental research.
-
INTRODUCTION
Insects are an incredibly diverse class, with over 1
million described species. They provide essential pollination
services for agriculture (1,2), yet cause substantial
damage to crops (3), and are vectors of devastating diseases
(4). Further, they are important ecological, evolutionary,
developmental and medical models. The genomes of
insects sequenced to date have already provided important
insights into genome architecture and evolution (57),
immune response pathways (8), eusociality (9,10) and
speciation (11,12). Sequencing further genomes brings great
promise to answer pending questions for basic and applied
biology.
Decreasing whole-genome sequencing costs, coupled
with moderate genome sizes for insects and their relatives,
favor new genome projects for this group. One downside
for arthropod comparative genomic research is the great
divergence times between arthropod groups, often in the
hundreds of millions of years. Comparative analyses would
therefore benefit tremendously from sequence information
across the breadth of Arthropoda. The 5000 arthropod
genomes initiative (i5k) has therefore set the goal to
coordinate the sequencing of 5000 insect or related arthropod
species (13,14). As such, the i5k initiative should galvanize
the generation of large amounts of data of exceptional
comparative value. While the i5k initiative will provide guidance
on the sequencing of genomes, the onus is still on
individual labs with a specific interest in these genomes to
organize the sequencing, analysis and curation of their genome
projects. To ensure wide re-use of genomic data resulting
from the i5k project, it is important that this data is hosted
in a centralized location for data sharing, dissemination,
visualization and curation. However, developing, deploying
and maintaining databases and web servers for genome
access and curation (genome portals) is often beyond the
financial and technical reach of smaller genome projects.
Here, we introduce the i5k Workspace@NAL (https://
i5k.nal.usda.gov), which the National Agricultural Library
(NAL) designed to meet the genome hosting needs of the
i5k community. The i5k Workspace@NAL has two main
goals. First, it aims to help the i5k data producers, in
particular orphaned groups without the technical or financial
means for genome hosting, at the interface of sequence
retrieval and analysisi.e. how do you access, visualize,
curate and disseminate your data once you have received it
from the sequencing center? Second, we aim to provide a
unified framework for data consumers to retrieve relevant
genomic information from our data providers. We outline
the i5k Workspace, and explain the steps that we take to help
i5k data producers disseminate and curate their genome
assemblies, and how we present this content to the i5k data
consumers. Finally, we explore the future directions the i5k
Workspace@NAL will take to continue improving its
services for the i5k community, and the arthropod genomics
community at large.
USING THE i5k WORKSPACE@NAL
Data producers
Setting up your genome portal. The i5k@NAL hosts
genome assemblies for any arthropod genome project that
requires our services. There is no hosting preference for
a particular taxonomic group or application, but we ask
that no agreements with other genome portals be in place,
to avoid redundant hosting and curation efforts. Our only
requirement is a genome assemblypreferably approved
by NCBI (the U.S. National Institutes of Health National
Center for Biotechnology Information) or other INSDC
(International Nucleotide Sequence Database
Collaboration) membersin FASTA format (scaffold, contig and
.agp mapping files). Any sequence features that have been
mapped to this assembly can also be submitted, including
but not limited to official gene sets (OGS), other consensus
gene sets or gene predictions, homology alignments,
RNASeq mappings and transcriptomes (Figure 1). We provide
a tutorial on how to map RNA-Seq reads to a genome
in iPlant
(https://i5k.nal.usda.gov/content/performing-rnaseq-alignments-iplant-baylor-i5k-pilot), and have
developed an extension of the exonerate alignment
program (15) that generates gff3-formatted output (https:
//github.com/hotdogee/exonerate-gff3), allowing users to
map moderately-sized transcriptomes against a genome
assembly. We will also transform BAM files to BigWig format
in-house if requested. We can advise on what files would
be useful to visualize, given the genome communitys
specific needs. Finally, we ask for genome communities to
provide us with information to populate the organisms
landing page, as well as metadata about each file given to us
to communicate to other users of the data. Data can be
transferred to us via ftp or iPlant (https://i5k.nal.usda.gov/
content/sharing-files-us). We recommend that each genome
community designate a community contact to serve as the
main contact for data files, as well as mediate the manual
curation process if this is to be a part of the genome project.
Data processingwhat we do with your data. For each
new organism and genome assembly, we generate
customized organism landing pages ( (...truncated)