Interfaces to PeptideAtlas: a case study of standard data access systems
Sarah Killcoyne
Jeremy Handcock
Thomas Robinson
Eric W. Deutsch
John Boyle
Access to public data sets is important to the scientific community as a resource to develop new experiments or validate new data. Projects such as the PeptideAtlas, Ensembl and The Cancer Genome Atlas (TCGA) offer both access to public data and a repository to share their own data. Access to these data sets is often provided through a web page form and a web service API. Access technologies based on web protocols (e.g. http) have been in use for over a decade and are widely adopted across the industry for a variety of functions (e.g. search, commercial transactions, and social media). Each architecture adapts these technologies to provide users with tools to access and share data. Both commonly used web service technologies (e.g. REST and SOAP), and custom-built solutions over HTTP are utilized in providing access to research data. Providing multiple access points ensures that the community can access the data in the simplest and most effective manner for their particular needs. This article examines three common access mechanisms for web accessible data: BioMart, caBIG, and Google Data Sources. These are illustrated by implementing each over the PeptideAtlas repository and reviewed for their suitability based on specific usages common to research. BioMart, Google Data Sources, and caBIG are each suitable for certain uses. The tradeoffs made in the development of the technology are dependent on the uses each was designed for (e.g. security versus speed). This means that an understanding of specific requirements and tradeoffs is necessary before selecting the access technology.
-
INTRODUCTION
Public repositories of research data are an important
resource to the scientific community in the
development of new experiments, analysis, annotation and
validation of data [1,2]. These repositories are also
growing at an ever-increasing rate as new instruments
and techniques are continually developed. Access to
these datasets is offered through a variety interfaces,
from FTP downloads to web-based database queries.
The basic goals for such interfaces are similar: provide
access to data for tools, analysis or sharing.
The PeptideAtlas [3] is a large-scale public
repository of observed and validated mass
spectrometry derived peptide spectra with associated
annotations [3]. It was originally designed as a public
resource to contribute observational data to genome
annotation and assist experimental design by defining
the mass spectrometry-based observable proteome
[4]. Satisfying this diverse usage required the provision
of an openly accessible resource to serve out
highquality data. Raw spectra data with protein/peptide
identifications from multiple species (e.g. human,
Corresponding author. John Boyle, Institute for Systems Biology, 401 Terry Ave N, Seattle WA 98109. Tel.: (206) 7321200;
E-mail:
Sarah Killcoyne works with software developers, computational biologists and experimentalists to develop and provide software
appropriate to research needs at the ISB.
Jeremy Handcock is an experienced software engineer and has held positions at ISB, Medio Systems, and Amazon.
Thomas Robinson is a software developer currently working on the organization and mining of large scale data, previously he
designed systems to detect and remediate malware at Sunbelt Software.
Eric W. Deutsch is a senior research scientist at the ISB, who leads the development of the Trans-Proteomic Pipeline tool suite and
PeptideAtlas Project.
John Boyle is the Director of Informatics at the ISB. His research interests are in large scale systems biology. The Institute for Systems
Biology was founded in 2000 to address the challenges of understanding biological diversity, particularly with regard to issues of
medicine, global health and the environment.
mouse, Caenorhabditis elegans, Saccharomyces cerevisiae
and Drosophila melanogaster) identified in mass
spectrometry experiments have been made public
through the PeptideAtlas. These data are available
to the thousands of users of the PeptideAtlas for use
in their own experimental pipelines.
The importance of multiple access points to public
repositories like PeptideAtlas lies in the large variety
of ad hoc and standard tools for integration and
analysis of scientific data. As an example, researchers in
proteomics use various spectra and sequence search
tools (e.g. SEQUEST [5], X!TANDEM [6] and
Mascot [7]), analysis tools for spectra quality or
peptide/protein identification (e.g. PeptideProphet [8],
PeptideSieve [9] and QualScore [10]) as well as data
mining (e.g. mspecLINE [11]) and visualization (e.g.
Cytoscape [12]) tools. It is important that each of
these very different tools has access to the required
data, tailored to their specific usage.
This article will review three widely used data
access frameworks: BioMart version 0.7 [13],
caBIG version 1.3.0.1 [14], and Google Data
Source API (GDS) version 1.0.2 [15]. These have
all been used to make PeptideAtlas available to
users. The Methods section will provide a
background on the usage and projected growth of
PeptideAtlas and introduce each of the three access
mechanisms being compared. The Results section
will discuss the criteria used for comparison with a
focus on describing the applicability of the
frameworks for potential users.
METHODS
This section will first discuss the present and future
usage of the PeptideAtlas data in proteomics
workflows. It will also provide a short overview of the
architecture of the access mechanisms as
implemented over the atlas. The second section will
provide the criteria used to evaluate caBIG, GDS and
BioMart with a description of the strengths and
weaknesses of each.
Usage of peptide atlas
PeptideAtlas is at present used to map experimentally
derived peptide sequences to proteins via the
reference database (per organism) and the corresponding
genes. Researchers use it both as a resource in the
process of searching for experimentally observed
peptides and to share their own data with others
across the community. Its continual growth in
observed data since it began in 2003 (Figure 1)
illustrates the importance of providing data access to
various workflows including: genome annotation [16],
model organism proteomic analysis [17], and human
disease biomarker identification [18].
In addition, it is an important resource for targeted
proteomic workflows [19]. As targeted proteomics
offers the opportunity to gather data with greater
accuracy and sensitivity [20], the PeptideAtlas will
become an even more important resource for the
generation of transition lists. Projects using SRM
(selected reaction monitoring) to catalogue
representative peptides for all detectable proteins are
currently under way for multiple species. These atlases
will offer the most comprehensive coverage of
known proteins, making it possible for researchers
to target proteins for quantification and biomarker
discovery. G (...truncated)