SEQADAPT: an adaptable system for the tracking, storage and analysis of high throughput sequencing experiments
David B Burdick
0
Chris C Cavnor
0
Jeremy Handcock
0
Sarah Killcoyne
0
Jake Lin
0
Bruz Marzolf
0
Stephen A Ramsey
0
Hector Rovira
0
Ryan Bressler
0
Ilya Shmulevich
0
John Boyle
0
0
Institute for Systems Biology
,
1441 North 34th Street, Seattle, WA 98103
,
USA
Background: High throughput sequencing has become an increasingly important tool for biological research. However, the existing software systems for managing and processing these data have not provided the flexible infrastructure that research requires. Results: Existing software solutions provide static and well-established algorithms in a restrictive package. However as high throughput sequencing is a rapidly evolving field, such static approaches lack the ability to readily adopt the latest advances and techniques which are often required by researchers. We have used a loosely coupled, service-oriented infrastructure to develop SeqAdapt. This system streamlines data management and allows for rapid integration of novel algorithms. Our approach also allows computational biologists to focus on developing and applying new methods instead of writing boilerplate infrastructure code. Conclusion: The system is based around the Addama service architecture and is available at our website as a demonstration web application, an installable single download and as a collection of individual customizable services.
-
Background
This paper introduces a flexible and loosely coupled
data management system for high throughput
sequencing experiments. The system is designed to face the
challenges of research, and is required as the versatility
and applicability of high throughput sequencing
experiments is growing rapidly. The system can be overlaid on
top of existing software, and can be used to integrate
different specialized algorithms.
There already exist a number of commercial solutions
(Geospizas GeneSifter [1], Genomatix Genome Analyzer
[2,3]), and non-commercial solutions (Galaxy [4],
CisGenome [5], ChIP-Seq Analysis Server [6]) for the
management and analysis of high throughput sequencing
information. The main drawback to these solutions is
that they focus on providing static one stop shop
solutions, which are designed to fit known markets, using
well-established methods. While these static systems are
useful for non-technical researchers in a production
science environment, they lack flexibility for the
research scientist who wishes to use cutting edge
methods and tools.
The existing systems tend to focus on well-established
applications for high throughput sequencing:
experiments where the technology is seen as a more accurate
digital equivalent to microarrays (e.g. RNA-Seq),
experiments to determine protein binding (e.g.
ChIPSeq), or large scale genome assembly projects. However,
high throughput sequencing has the potential of
becoming ubiquitous across many avenues of investigation.
This potential is due to both an increase in our
understanding of systems biology and the capabilities of the
new generation of instruments. As the field is constantly
evolving new discoveries are continually being made,
including new medically related functionality of small
RNAs [7], new families of RNA [8], and signaling through
extra-cellular RNAs [9]. New techniques and instruments
are also being developed that provide insight into these
new facets, due to an increase in throughput (e.g.
multiplexing [10,11] and long reads [12]) and
sophistication (e.g. BS-Seq and targeted approaches). For these
reasons, any sequencing software infrastructure used in the
research environment must be easily adaptable. By this
we mean it must have the ability to be readily changed
for new usage. For example, we can expect each research
area to require different mechanisms for normalization
and replication strategies, sample and experiment
vocabularies, and analysis algorithms. Generally within
research each project requires a large amount of de novo
analysis development and customization to support: new
technology strategies such as allowing for multiplexing or
integrating with new instrumentation; informatics
strategies, to allow for data and system integration; and new
computational strategies, to support analysis and
datamining tasks. Additionally, each laboratory will have their
own demands in terms of experiment QA, annotations
and integration with processes (e.g. preferred desktop
analysis tools) and integration with other data types.
Therefore, it is important that the research community
have access to a system that is:
Open. The system must be distributed as an open
software project as many users will need to modify the
system to meet their specific needs.
Standardized. The system should follow widely used
standards for both software development and data
exchange. This will ensure that the code base will be
easier to maintain and have greater connectivity with
external systems and tools.
Adaptable. The system must be easily adaptable
without requiring a detailed understanding of the
aspects of the internal software architecture. In this way,
significant modifications can be implemented efficiently
and quickly.
Deployable. The system must be easy to rapidly
deploy and modify. A system that is cumbersome or
overly complex wastes the end users development time
with unnecessary setup and technical details.
SeqAdapt follows these principles, and provides a
standardized and modular architecture which is easy to
use, adapt and maintain. The underlying enterprise
architecture, Addama [13] has been designed to provide
the adaptability required to enable the rapid
development needed within research driven science.
Implementation
To meet the demands of researchers we have developed
SeqAdapt, a solution that is able to: scale to meet the
requirements of the research environment, use best
practices for mainstay applications (e.g. ChIP-Seq), and
be readily adapted to new usage.
The system is built using a general software
infrastructure to support Adaptable Data Management
(Addama). SeqAdapt integrates external sample tracking
software (e.g. SLIMseq [14]), workflows for executing
analyses (e.g. the MACS algorithm [15]) and robust
data management (e.g. JCR) to provide a modular and
adaptable system for high throughput sequencing
experiments.
Due to the data volumes involved with high
throughput sequencing a software infrastructure is often
required to facilitate storage, management and analysis.
We have used the Addama system to provide the
necessary support for the creation of a workflow
encompassing the entire process (see Figure 1) that is complete,
lightweight and easily adapted to changing requirements.
This solution allows for changes in the underlying
sequencing technology while still providing the ability to
plug in new processing methods. A pluggable
architecture is important as the technology, data formats, and
processing methods are changing rapidly in the field of
sequencing. Performance and the ability to scale up as
datasets grow (...truncated)