Interfaces to PeptideAtlas: a case study of standard data access systems (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bib/article-pdf/13/5/615/1146643/bbr067.pdf

Interfaces to PeptideAtlas: a case study of standard data access systems

B RIEFINGS IN BIOINF ORMATICS . VOL 13. NO 5. 615^ 626 Advance Access published on 22 November 2011 doi:10.1093/bib/bbr067 Interfaces to PeptideAtlas: a case study of standard data access systems Sarah Killcoyne, Jeremy Handcock, Thomas Robinson, Eric W. Deutsch and John Boyle Submitted: 1st September 2011; Received (in revised form): 3rd October 2011 Abstract Keywords: BioMart; Google Data Sources; caBIG; data access; proteomics INTRODUCTION Public repositories of research data are an important resource to the scientific community in the development of new experiments, analysis, annotation and validation of data [1,2]. These repositories are also growing at an ever-increasing rate as new instruments and techniques are continually developed. Access to these datasets is offered through a variety interfaces, from FTP downloads to web-based database queries. The basic goals for such interfaces are similar: provide access to data for tools, analysis or sharing. The PeptideAtlas [3] is a large-scale public repository of observed and validated mass spectrometry derived peptide spectra with associated annotations [3]. It was originally designed as a public resource to contribute observational data to genome annotation and assist experimental design by defining the mass spectrometry-based observable proteome [4]. Satisfying this diverse usage required the provision of an openly accessible resource to serve out highquality data. Raw spectra data with protein/peptide identifications from multiple species (e.g. human, Corresponding author. John Boyle, Institute for Systems Biology, 401 Terry Ave N, Seattle WA 98109. Tel.: (206) 732–1200; E-mail: Sarah Killcoyne works with software developers, computational biologists and experimentalists to develop and provide software appropriate to research needs at the ISB. Jeremy Handcock is an experienced software engineer and has held positions at ISB, Medio Systems, and Amazon. Thomas Robinson is a software developer currently working on the organization and mining of large scale data, previously he designed systems to detect and remediate ‘‘malware’’ at Sunbelt Software. Eric W. Deutsch is a senior research scientist at the ISB, who leads the development of the Trans-Proteomic Pipeline tool suite and PeptideAtlas Project. John Boyle is the Director of Informatics at the ISB. His research interests are in large scale systems biology. The Institute for Systems Biology was founded in 2000 to address the challenges of understanding biological diversity, particularly with regard to issues of medicine, global health and the environment. ß The Author 2011. Published by Oxford University Press. For Permissions, please email: Access to public data sets is important to the scientific community as a resource to develop new experiments or validate new data. Projects such as the PeptideAtlas, Ensembl and The Cancer Genome Atlas (TCGA) offer both access to public data and a repository to share their own data. Access to these data sets is often provided through a web page form and a web service API. Access technologies based on web protocols (e.g. http) have been in use for over a decade and are widely adopted across the industry for a variety of functions (e.g. search, commercial transactions, and social media). Each architecture adapts these technologies to provide users with tools to access and share data. Both commonly used web service technologies (e.g. REST and SOAP), and custom-built solutions over HTTP are utilized in providing access to research data. Providing multiple access points ensures that the community can access the data in the simplest and most effective manner for their particular needs. This article examines three common access mechanisms for web accessible data: BioMart, caBIG, and Google Data Sources. These are illustrated by implementing each over the PeptideAtlas repository and reviewed for their suitability based on specific usages common to research. BioMart, Google Data Sources, and caBIG are each suitable for certain uses. The tradeoffs made in the development of the technology are dependent on the uses each was designed for (e.g. security versus speed). This means that an understanding of specific requirements and tradeoffs is necessary before selecting the access technology. 616 Killcoyne et al. METHODS This section will first discuss the present and future usage of the PeptideAtlas data in proteomics workflows. It will also provide a short overview of the architecture of the access mechanisms as implemented over the atlas. The second section will provide the criteria used to evaluate caBIG, GDS and BioMart with a description of the strengths and weaknesses of each. Usage of peptide atlas PeptideAtlas is at present used to map experimentally derived peptide sequences to proteins via the reference database (per organism) and the corresponding genes. Researchers use it both as a resource in the process of searching for experimentally observed peptides and to share their own data with others across the community. Its continual growth in Figure 1: PeptideAtlas began in 2003, and, over the years, data from 230 human LC-MS/MS experiments, comprising a total of about 55 million spectra, have been added. About 8% of those spectra could be assigned highly confident peptide identifications. At present, the human PeptideAtlas contains about 4.5 million identified spectra corresponding to about 60 000 distinct identified peptides. These peptides map to 7553 highly non-redundant protein identifiers, covering about 1/3 of the protein-coding genes in the human genome. observed data since it began in 2003 (Figure 1) illustrates the importance of providing data access to various workflows including: genome annotation [16], model organism proteomic analysis [17], and human disease biomarker identification [18]. In addition, it is an important resource for targeted proteomic workflows [19]. As targeted proteomics offers the opportunity to gather data with greater accuracy and sensitivity [20], the PeptideAtlas will become an even more important resource for the generation of transition lists. Projects using SRM (selected reaction monitoring) to catalogue representative peptides for all detectable proteins are currently under way for multiple species. These atlases will offer the most comprehensive coverage of known proteins, making it possible for researchers to target proteins for quantification and biomarker discovery. Generating these data for the atlas requires the development of new tools to assist in laboratory automation (e.g. a laboratory information management system or LIMS) and high-throughput methods for QA verification. These tools require programmatic access mechanisms for use in automated pipelines as well as manual searching using visual interfaces. Access to PeptideAtlas Due to the diversity of applications of large spectral repositories, both in terms of usage and users, a mouse, Caenorhabditis elega (...truncated)