Interfaces to PeptideAtlas: a case study of standard data access systems (pdf)

Article PDF cannot be displayed. You can download it here:

https://bib.oxfordjournals.org/content/13/5/615.full.pdf

Interfaces to PeptideAtlas: a case study of standard data access systems

Sarah Killcoyne Jeremy Handcock Thomas Robinson Eric W. Deutsch John Boyle Access to public data sets is important to the scientific community as a resource to develop new experiments or validate new data. Projects such as the PeptideAtlas, Ensembl and The Cancer Genome Atlas (TCGA) offer both access to public data and a repository to share their own data. Access to these data sets is often provided through a web page form and a web service API. Access technologies based on web protocols (e.g. http) have been in use for over a decade and are widely adopted across the industry for a variety of functions (e.g. search, commercial transactions, and social media). Each architecture adapts these technologies to provide users with tools to access and share data. Both commonly used web service technologies (e.g. REST and SOAP), and custom-built solutions over HTTP are utilized in providing access to research data. Providing multiple access points ensures that the community can access the data in the simplest and most effective manner for their particular needs. This article examines three common access mechanisms for web accessible data: BioMart, caBIG, and Google Data Sources. These are illustrated by implementing each over the PeptideAtlas repository and reviewed for their suitability based on specific usages common to research. BioMart, Google Data Sources, and caBIG are each suitable for certain uses. The tradeoffs made in the development of the technology are dependent on the uses each was designed for (e.g. security versus speed). This means that an understanding of specific requirements and tradeoffs is necessary before selecting the access technology. - INTRODUCTION Public repositories of research data are an important resource to the scientific community in the development of new experiments, analysis, annotation and validation of data [1,2]. These repositories are also growing at an ever-increasing rate as new instruments and techniques are continually developed. Access to these datasets is offered through a variety interfaces, from FTP downloads to web-based database queries. The basic goals for such interfaces are similar: provide access to data for tools, analysis or sharing. The PeptideAtlas [3] is a large-scale public repository of observed and validated mass spectrometry derived peptide spectra with associated annotations [3]. It was originally designed as a public resource to contribute observational data to genome annotation and assist experimental design by defining the mass spectrometry-based observable proteome [4]. Satisfying this diverse usage required the provision of an openly accessible resource to serve out highquality data. Raw spectra data with protein/peptide identifications from multiple species (e.g. human, Corresponding author. John Boyle, Institute for Systems Biology, 401 Terry Ave N, Seattle WA 98109. Tel.: (206) 7321200; E-mail: Sarah Killcoyne works with software developers, computational biologists and experimentalists to develop and provide software appropriate to research needs at the ISB. Jeremy Handcock is an experienced software engineer and has held positions at ISB, Medio Systems, and Amazon. Thomas Robinson is a software developer currently working on the organization and mining of large scale data, previously he designed systems to detect and remediate malware at Sunbelt Software. Eric W. Deutsch is a senior research scientist at the ISB, who leads the development of the Trans-Proteomic Pipeline tool suite and PeptideAtlas Project. John Boyle is the Director of Informatics at the ISB. His research interests are in large scale systems biology. The Institute for Systems Biology was founded in 2000 to address the challenges of understanding biological diversity, particularly with regard to issues of medicine, global health and the environment. mouse, Caenorhabditis elegans, Saccharomyces cerevisiae and Drosophila melanogaster) identified in mass spectrometry experiments have been made public through the PeptideAtlas. These data are available to the thousands of users of the PeptideAtlas for use in their own experimental pipelines. The importance of multiple access points to public repositories like PeptideAtlas lies in the large variety of ad hoc and standard tools for integration and analysis of scientific data. As an example, researchers in proteomics use various spectra and sequence search tools (e.g. SEQUEST [5], X!TANDEM [6] and Mascot [7]), analysis tools for spectra quality or peptide/protein identification (e.g. PeptideProphet [8], PeptideSieve [9] and QualScore [10]) as well as data mining (e.g. mspecLINE [11]) and visualization (e.g. Cytoscape [12]) tools. It is important that each of these very different tools has access to the required data, tailored to their specific usage. This article will review three widely used data access frameworks: BioMart version 0.7 [13], caBIG version 1.3.0.1 [14], and Google Data Source API (GDS) version 1.0.2 [15]. These have all been used to make PeptideAtlas available to users. The Methods section will provide a background on the usage and projected growth of PeptideAtlas and introduce each of the three access mechanisms being compared. The Results section will discuss the criteria used for comparison with a focus on describing the applicability of the frameworks for potential users. METHODS This section will first discuss the present and future usage of the PeptideAtlas data in proteomics workflows. It will also provide a short overview of the architecture of the access mechanisms as implemented over the atlas. The second section will provide the criteria used to evaluate caBIG, GDS and BioMart with a description of the strengths and weaknesses of each. Usage of peptide atlas PeptideAtlas is at present used to map experimentally derived peptide sequences to proteins via the reference database (per organism) and the corresponding genes. Researchers use it both as a resource in the process of searching for experimentally observed peptides and to share their own data with others across the community. Its continual growth in observed data since it began in 2003 (Figure 1) illustrates the importance of providing data access to various workflows including: genome annotation [16], model organism proteomic analysis [17], and human disease biomarker identification [18]. In addition, it is an important resource for targeted proteomic workflows [19]. As targeted proteomics offers the opportunity to gather data with greater accuracy and sensitivity [20], the PeptideAtlas will become an even more important resource for the generation of transition lists. Projects using SRM (selected reaction monitoring) to catalogue representative peptides for all detectable proteins are currently under way for multiple species. These atlases will offer the most comprehensive coverage of known proteins, making it possible for researchers to target proteins for quantification and biomarker discovery. G (...truncated)