Adaptable data management for systems biology investigations (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/1471-2105-10-79

Adaptable data management for systems biology investigations

John Boyle 0 Hector Rovira 0 Chris Cavnor 0 David Burdick 0 Sarah Killcoyne 0 Ilya Shmulevich 0 0 Address: Institute for Systems Biology , 1441 N 34th Street, Seattle, WA 98103 , USA Background: Within research each experiment is different, the focus changes and the data is generated from a continually evolving barrage of technologies. There is a continual introduction of new techniques whose usage ranges from in-house protocols through to high-throughput instrumentation. To support these requirements data management systems are needed that can be rapidly built and readily adapted for new usage. Results: The adaptable data management system discussed is designed to support the seamless mining and analysis of biological experiment data that is commonly used in systems biology (e.g. ChIP-chip, gene expression, proteomics, imaging, flow cytometry). We use different content graphs to represent different views upon the data. These views are designed for different roles: equipment specific views are used to gather instrumentation information; data processing oriented views are provided to enable the rapid development of analysis applications; and research project specific views are used to organize information for individual research experiments. This management system allows for both the rapid introduction of new types of information and the evolution of the knowledge it represents. Conclusion: Data management is an important aspect of any research enterprise. It is the foundation on which most applications are built, and must be easily extended to serve new functionality for new scientific areas. We have found that adopting a three-tier architecture for data management, built around distributed standardized content repositories, allows us to rapidly develop new applications to support a diverse user community. - Background To enable the adaptive behaviour that is required when developing software for research an "informal" data management strategy is often needed. By informal we mean there is a need to rapidly develop and adapt software infrastructures to unforeseen and (typically) unspecified requirements. We have found that the use of a distributed data management system (consisting of remote interlinked content repositories) gives us the required flexibility, while still allowing for the development of the level of formalization that is required for robust software development. Advances in computer science have pushed what can be achieved with data management systems, and conversely these advancements have driven the increase in demands for richer functionality. The computer science research advancements have involved both hardware and software, with faster processor speeds enabling other innovations to become feasible. The way in which data management systems are built, and extended, has also changed. These changes in software engineering and design include: the methodology through which software is constructed (e.g. components leading to frameworks, and frameworks leading to aspects [1]); the technology used to allow for distributed computing (e.g. object brokers evolving pass-byvalue mechanisms, and these being replaced by stateless Web Services); and the ideology that is used to define the process through which software is built (e.g. the "rational" processes being replaced by agile programming). These advances are continuing to occur, and will have an effect on the next generation of data management and distribution tools (e.g. cloud computing becoming mainstream through the use of Google App Engine or similar). A number of companies, and academic institutions, have marketed integration and data management solutions for the life sciences. These enterprise data integration (and distributed process) management systems have evolved over the last 10 years. This evolution has been from single database based solutions to open, distributed, interoperable data management solutions (see Figure 1). This change has been driven by demands for rapid development, high levels of interoperability and increases in data volume and complexity. There has been a natural progression with these integration systems, as they generally follow the traditional approaches to software designs and technologies that are prevalent at the time. There are numerous examples of the application of technical innovations being the focus of a specific integration product, for example: in 1996 SRS [2] (from Lion Bioscience) advocated external indexing to link between numerous gene and protein data sources; in 1997 the Discovery Center (from Netgenics) used CORBA [3] based distributed components to provide bespoke integration products; in 1998 the Alliance framework (from Synomics) promoted an ntier application server distributed system, which used linked domain specific modules; in 1999 the MetaLayer (from Tripos) utilized XML message passing; in 2000 DiscoveryLink [4] (from IBM) provided a federated database solution which linked across different databases and flat files; in 2001 the Genomics Knowledge Platform (from Incyte) marketed an object integration solution based solely on EJBs; in 2002 the I3C (a consortium led by Sun and Oracle) specified the use of an identity driven approach to integration; in 2003 the LSP (from Oracle) advocated the use of Web Services; in 2004 IPA (from EFvigoulurteio1n of enterprise architectures has occurred within the life sciences Evolution of enterprise architectures has occurred within the life sciences. Limitations in the flexibility of data repositories based solutions helped shape the development of integration frameworks. Integration frameworks suffered from complexity and interoperability problems, and so document based solutions are now becoming the norm. Ingenuity) and MetaCore (from GeneGO) used a knowledge base to provide a solution for the mining of networks of integrated data; in 2005 caBIG [5] (from NCI) adopted a MDA (model driven architecture) approach, built using a J2EE and Web Service based solution, to standardize their community integration efforts; in 2006 CancerGRID (from MRC) delivered a resource framework based Web Service system to bridge between diverse data sources; and in 2007 caGRID [6] (from NCI) provided a stateful Web Service and registry system for loosely coupled data and analysis services. One common characteristic of these "technology first" efforts is that they developed solutions that were designed to work with static "finished" data, not research information. This style of system works well within publishing scenarios, where information is to be made available throughout an enterprise as a static resource. When actually working within a research institution (or life science company), where new technologies and ideas are continually being developed, such static publishing approaches are not appropriate. Instead, a flexible analysis and access system is required that allows for the rapid introduction and integration of man (...truncated)