Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
et al. (2014) Large Scale Identification and Categorization of Protein Sequences Using
Structured Logistic Regression. PLoS ONE 9(1): e85139. doi:10.1371/journal.pone.0085139
Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
Bjrn P. Pedersen 0
Georgiana Ifrim 0
Poul Liboriussen 0
Kristian B. Axelsen 0
Michael G. Palmgren 0
Poul Nissen 0
Carsten Wiuf 0
Christian N. S. Pedersen 0
Dinesh Gupta, International Centre for Genetic Engineering and Biotechnology (ICGEB), India
0 1 Centre for Membrane Pumps in Cells and Disease - PUMPKIN, Danish National Research Foundation, Aarhus C, Denmark, 2 Department of Molecular Biology, Aarhus University , Aarhus C, Denmark , 3 Bioinformatics Research Centre, Aarhus University , Aarhus C, Denmark , 4 Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU , Geneva , Switzerland , 5 Department of Plant and Environmental Sciences, University of Copenhagen , Frederiksberg C, Denmark , 6 Department of Mathematical Sciences, University of Copenhagen , Copenhagen , Denmark , 7 INSIGHT Centre for Data Analytics, University College Dublin , Dublin , Ireland
Background: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATPdriven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. Results: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 predefined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. Conclusions: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.
-
Funding: BPP was supported by a post-doctoral fellowship from the Carlsberg Foundation and by the Danish Cancer Society. CW was supported by the Danish
Cancer Society. GI was supported by Science Foundation Ireland INSIGHT Centre for Data Analytics, Science Foundation Ireland grant 10/IN.1/I3032 and by the
Danish Cancer Society. PN was supported by a Hallas-Mller stipend from the Novo Nordisk Foundation and by the BIOMEMOS advanced research program of the
European Research Council. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
. These authors contributed equally to this work.
Systematic sequencing efforts in the last decade have provided
complete sequences of an increasing number of genomes, and a
large amount of sequence information is available from other
organisms. A traditional analysis based on a multiple sequence
alignment (MSA) and tree reconstruction might be computational
feasible for up to ,100k sequences using fast MSA heuristics such
as MAFFT and efficient implementations of the canonical
neighbour-joining (NJ) method such as QuickTree [30] or
RapidNJ [47], or heuristics such as ClearCut [46]. For
largerscale sequence classification, machine learning based methods
such as (profile) hidden Markov models (HMM) and Support
Vector Machines (SVM) are applicable. These machine learning
methods are trained on a subset of the data and then used to
rapidly classify unknown sequences.
A possible alternative to HMM and SVM is Structured Logistic
Regression (SLR) [23]. SLR is a recently developed machine
learning method that has not been previously applied to
largescale classification problems in bioinformatics, but have shown
great promise in other types of classification [23]. In this paper we
provide a proof-of-concept application of SLR to a large-scale
classification problem in bioinformatics. We use classification of
PType ATPases as our application because we believe it can
generate important biological information. Also the rapidly
increasing number of possible P-type ATPases calls for an
automated procedure to facilitate the quick analysis of their
distribution into different classes to guide biochemical
experiments. Since SLR has been shown previously to compare
favourable with SVM [23], we have chosen to compare the
performance of our SLR based classifier to an profile HMM based
classifier, and, for a smaller set of sequences, to a traditional
MSA
We have applied SLR-classifiers to the entire UniProtKB v.
15.8 [24] to identify new P-type ATPases and further classify them
into the 11 known subfamilies. To examine the per-species
distribution of ATPases, we have analyzed 1,123 genomes.
Furthermore, an analysis of the predicted membrane topology of
P-type ATPases found in these genomes shows that the
transmembrane region can be described as a three component
system containing a core region of 6 transmembrane helices and
two elements that reside on the N- and C-terminal part.
Description of Structured Logistic Regression
Structured Logistic Regression (SLR) is a machine learning tool
first proposed in the context of text categorization [23]. SLR takes
as input a training set of n samples, {xi, yi}, i = 1,..., n, where xi is a
sequence, and yi M {+1,21} are labels indicating the class. The
SLR output is a set of discriminating subsequences of unrestricted
length (also known as k-mers or n-grams, with k or n unrestricted in
this case; in this work we refer to them simply as predictors)
together with their weights wj indicative of their discriminative
power. The SLR decision function is linear:
wj I predictorj [ xi
where k is the total number of selected predictors and I(.) is the
indicator function. To predict class membership of xi, the score
f (xi) is related to the probability that xi belongs to class +1:
pyi~z1jxi, w~
The learning algorithm is based on a coordinate-wise gradient
ascent optimization technique for iteratively maximizing the
likelihood of the training set [23]. Upon optimizing the likelihood,
the algorithm outputs a compact set of discriminative predictors to
be used for classification. The o (...truncated)