Interfaces to PeptideAtlas: a case study of standard data access systems
B RIEFINGS IN BIOINF ORMATICS . VOL 13. NO 5. 615^ 626
Advance Access published on 22 November 2011
doi:10.1093/bib/bbr067
Interfaces to PeptideAtlas: a case study
of standard data access systems
Sarah Killcoyne, Jeremy Handcock, Thomas Robinson, Eric W. Deutsch and John Boyle
Submitted: 1st September 2011; Received (in revised form): 3rd October 2011
Abstract
Keywords: BioMart; Google Data Sources; caBIG; data access; proteomics
INTRODUCTION
Public repositories of research data are an important
resource to the scientific community in the development of new experiments, analysis, annotation and
validation of data [1,2]. These repositories are also
growing at an ever-increasing rate as new instruments
and techniques are continually developed. Access to
these datasets is offered through a variety interfaces,
from FTP downloads to web-based database queries.
The basic goals for such interfaces are similar: provide
access to data for tools, analysis or sharing.
The PeptideAtlas [3] is a large-scale public
repository of observed and validated mass spectrometry derived peptide spectra with associated annotations [3]. It was originally designed as a public
resource to contribute observational data to genome
annotation and assist experimental design by defining
the mass spectrometry-based observable proteome
[4]. Satisfying this diverse usage required the provision
of an openly accessible resource to serve out highquality data. Raw spectra data with protein/peptide
identifications from multiple species (e.g. human,
Corresponding author. John Boyle, Institute for Systems Biology, 401 Terry Ave N, Seattle WA 98109. Tel.: (206) 732–1200;
E-mail:
Sarah Killcoyne works with software developers, computational biologists and experimentalists to develop and provide software
appropriate to research needs at the ISB.
Jeremy Handcock is an experienced software engineer and has held positions at ISB, Medio Systems, and Amazon.
Thomas Robinson is a software developer currently working on the organization and mining of large scale data, previously he
designed systems to detect and remediate ‘‘malware’’ at Sunbelt Software.
Eric W. Deutsch is a senior research scientist at the ISB, who leads the development of the Trans-Proteomic Pipeline tool suite and
PeptideAtlas Project.
John Boyle is the Director of Informatics at the ISB. His research interests are in large scale systems biology. The Institute for Systems
Biology was founded in 2000 to address the challenges of understanding biological diversity, particularly with regard to issues of
medicine, global health and the environment.
ß The Author 2011. Published by Oxford University Press. For Permissions, please email:
Access to public data sets is important to the scientific community as a resource to develop new experiments or
validate new data. Projects such as the PeptideAtlas, Ensembl and The Cancer Genome Atlas (TCGA) offer both
access to public data and a repository to share their own data. Access to these data sets is often provided through
a web page form and a web service API. Access technologies based on web protocols (e.g. http) have been in use
for over a decade and are widely adopted across the industry for a variety of functions (e.g. search, commercial
transactions, and social media). Each architecture adapts these technologies to provide users with tools to access
and share data. Both commonly used web service technologies (e.g. REST and SOAP), and custom-built solutions
over HTTP are utilized in providing access to research data. Providing multiple access points ensures that the community can access the data in the simplest and most effective manner for their particular needs. This article examines three common access mechanisms for web accessible data: BioMart, caBIG, and Google Data Sources. These
are illustrated by implementing each over the PeptideAtlas repository and reviewed for their suitability based on
specific usages common to research. BioMart, Google Data Sources, and caBIG are each suitable for certain uses.
The tradeoffs made in the development of the technology are dependent on the uses each was designed for (e.g. security versus speed). This means that an understanding of specific requirements and tradeoffs is necessary before selecting the access technology.
616
Killcoyne et al.
METHODS
This section will first discuss the present and future
usage of the PeptideAtlas data in proteomics workflows. It will also provide a short overview of the
architecture of the access mechanisms as implemented over the atlas. The second section will provide the criteria used to evaluate caBIG, GDS and
BioMart with a description of the strengths and
weaknesses of each.
Usage of peptide atlas
PeptideAtlas is at present used to map experimentally
derived peptide sequences to proteins via the reference database (per organism) and the corresponding
genes. Researchers use it both as a resource in the
process of searching for experimentally observed
peptides and to share their own data with others
across the community. Its continual growth in
Figure 1: PeptideAtlas began in 2003, and, over the
years, data from 230 human LC-MS/MS experiments,
comprising a total of about 55 million spectra, have
been added. About 8% of those spectra could be assigned highly confident peptide identifications. At present, the human PeptideAtlas contains about 4.5
million identified spectra corresponding to about
60 000 distinct identified peptides. These peptides map
to 7553 highly non-redundant protein identifiers, covering about 1/3 of the protein-coding genes in the human
genome.
observed data since it began in 2003 (Figure 1) illustrates the importance of providing data access to various workflows including: genome annotation [16],
model organism proteomic analysis [17], and human
disease biomarker identification [18].
In addition, it is an important resource for targeted
proteomic workflows [19]. As targeted proteomics
offers the opportunity to gather data with greater
accuracy and sensitivity [20], the PeptideAtlas will
become an even more important resource for the
generation of transition lists. Projects using SRM
(selected reaction monitoring) to catalogue representative peptides for all detectable proteins are currently under way for multiple species. These atlases
will offer the most comprehensive coverage of
known proteins, making it possible for researchers
to target proteins for quantification and biomarker
discovery. Generating these data for the atlas requires
the development of new tools to assist in laboratory
automation (e.g. a laboratory information management system or LIMS) and high-throughput methods for QA verification. These tools require
programmatic access mechanisms for use in automated pipelines as well as manual searching using
visual interfaces.
Access to PeptideAtlas
Due to the diversity of applications of large spectral
repositories, both in terms of usage and users, a
mouse, Caenorhabditis elega (...truncated)