Adaptable data management for systems biology investigations
John Boyle
0
Hector Rovira
0
Chris Cavnor
0
David Burdick
0
Sarah Killcoyne
0
Ilya Shmulevich
0
0
Address: Institute for Systems Biology
,
1441 N 34th Street, Seattle, WA 98103
,
USA
Background: Within research each experiment is different, the focus changes and the data is generated from a continually evolving barrage of technologies. There is a continual introduction of new techniques whose usage ranges from in-house protocols through to high-throughput instrumentation. To support these requirements data management systems are needed that can be rapidly built and readily adapted for new usage. Results: The adaptable data management system discussed is designed to support the seamless mining and analysis of biological experiment data that is commonly used in systems biology (e.g. ChIP-chip, gene expression, proteomics, imaging, flow cytometry). We use different content graphs to represent different views upon the data. These views are designed for different roles: equipment specific views are used to gather instrumentation information; data processing oriented views are provided to enable the rapid development of analysis applications; and research project specific views are used to organize information for individual research experiments. This management system allows for both the rapid introduction of new types of information and the evolution of the knowledge it represents. Conclusion: Data management is an important aspect of any research enterprise. It is the foundation on which most applications are built, and must be easily extended to serve new functionality for new scientific areas. We have found that adopting a three-tier architecture for data management, built around distributed standardized content repositories, allows us to rapidly develop new applications to support a diverse user community.
-
Background
To enable the adaptive behaviour that is required when
developing software for research an "informal" data
management strategy is often needed. By informal we mean
there is a need to rapidly develop and adapt software
infrastructures to unforeseen and (typically) unspecified
requirements. We have found that the use of a distributed
data management system (consisting of remote
interlinked content repositories) gives us the required
flexibility, while still allowing for the development of the level of
formalization that is required for robust software
development.
Advances in computer science have pushed what can be
achieved with data management systems, and conversely
these advancements have driven the increase in demands
for richer functionality. The computer science research
advancements have involved both hardware and software,
with faster processor speeds enabling other innovations to
become feasible. The way in which data management
systems are built, and extended, has also changed. These
changes in software engineering and design include: the
methodology through which software is constructed (e.g.
components leading to frameworks, and frameworks
leading to aspects [1]); the technology used to allow for
distributed computing (e.g. object brokers evolving
pass-byvalue mechanisms, and these being replaced by stateless
Web Services); and the ideology that is used to define the
process through which software is built (e.g. the "rational"
processes being replaced by agile programming). These
advances are continuing to occur, and will have an effect
on the next generation of data management and
distribution tools (e.g. cloud computing becoming mainstream
through the use of Google App Engine or similar).
A number of companies, and academic institutions, have
marketed integration and data management solutions for
the life sciences. These enterprise data integration (and
distributed process) management systems have evolved
over the last 10 years. This evolution has been from single
database based solutions to open, distributed,
interoperable data management solutions (see Figure 1). This
change has been driven by demands for rapid
development, high levels of interoperability and increases in data
volume and complexity. There has been a natural
progression with these integration systems, as they generally
follow the traditional approaches to software designs and
technologies that are prevalent at the time. There are
numerous examples of the application of technical
innovations being the focus of a specific integration product,
for example: in 1996 SRS [2] (from Lion Bioscience)
advocated external indexing to link between numerous gene
and protein data sources; in 1997 the Discovery Center
(from Netgenics) used CORBA [3] based distributed
components to provide bespoke integration products; in 1998
the Alliance framework (from Synomics) promoted an
ntier application server distributed system, which used
linked domain specific modules; in 1999 the MetaLayer
(from Tripos) utilized XML message passing; in 2000
DiscoveryLink [4] (from IBM) provided a federated database
solution which linked across different databases and flat
files; in 2001 the Genomics Knowledge Platform (from
Incyte) marketed an object integration solution based
solely on EJBs; in 2002 the I3C (a consortium led by Sun
and Oracle) specified the use of an identity driven
approach to integration; in 2003 the LSP (from Oracle)
advocated the use of Web Services; in 2004 IPA (from
EFvigoulurteio1n of enterprise architectures has occurred within the life sciences
Evolution of enterprise architectures has occurred within the life sciences. Limitations in the flexibility of data
repositories based solutions helped shape the development of integration frameworks. Integration frameworks suffered from
complexity and interoperability problems, and so document based solutions are now becoming the norm.
Ingenuity) and MetaCore (from GeneGO) used a
knowledge base to provide a solution for the mining of networks
of integrated data; in 2005 caBIG [5] (from NCI) adopted
a MDA (model driven architecture) approach, built using
a J2EE and Web Service based solution, to standardize
their community integration efforts; in 2006 CancerGRID
(from MRC) delivered a resource framework based Web
Service system to bridge between diverse data sources; and
in 2007 caGRID [6] (from NCI) provided a stateful Web
Service and registry system for loosely coupled data and
analysis services.
One common characteristic of these "technology first"
efforts is that they developed solutions that were designed
to work with static "finished" data, not research
information. This style of system works well within publishing
scenarios, where information is to be made available
throughout an enterprise as a static resource. When
actually working within a research institution (or life science
company), where new technologies and ideas are
continually being developed, such static publishing approaches
are not appropriate. Instead, a flexible analysis and access
system is required that allows for the rapid introduction
and integration of man (...truncated)