Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0085139&type=printable

Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression

et al. (2014) Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression. PLoS ONE 9(1): e85139. doi:10.1371/journal.pone.0085139 Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression Bjrn P. Pedersen 0 Georgiana Ifrim 0 Poul Liboriussen 0 Kristian B. Axelsen 0 Michael G. Palmgren 0 Poul Nissen 0 Carsten Wiuf 0 Christian N. S. Pedersen 0 Dinesh Gupta, International Centre for Genetic Engineering and Biotechnology (ICGEB), India 0 1 Centre for Membrane Pumps in Cells and Disease - PUMPKIN, Danish National Research Foundation, Aarhus C, Denmark, 2 Department of Molecular Biology, Aarhus University , Aarhus C, Denmark , 3 Bioinformatics Research Centre, Aarhus University , Aarhus C, Denmark , 4 Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU , Geneva , Switzerland , 5 Department of Plant and Environmental Sciences, University of Copenhagen , Frederiksberg C, Denmark , 6 Department of Mathematical Sciences, University of Copenhagen , Copenhagen , Denmark , 7 INSIGHT Centre for Data Analytics, University College Dublin , Dublin , Ireland Background: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATPdriven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. Results: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 predefined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. Conclusions: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis. - Funding: BPP was supported by a post-doctoral fellowship from the Carlsberg Foundation and by the Danish Cancer Society. CW was supported by the Danish Cancer Society. GI was supported by Science Foundation Ireland INSIGHT Centre for Data Analytics, Science Foundation Ireland grant 10/IN.1/I3032 and by the Danish Cancer Society. PN was supported by a Hallas-Mller stipend from the Novo Nordisk Foundation and by the BIOMEMOS advanced research program of the European Research Council. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. . These authors contributed equally to this work. Systematic sequencing efforts in the last decade have provided complete sequences of an increasing number of genomes, and a large amount of sequence information is available from other organisms. A traditional analysis based on a multiple sequence alignment (MSA) and tree reconstruction might be computational feasible for up to ,100k sequences using fast MSA heuristics such as MAFFT and efficient implementations of the canonical neighbour-joining (NJ) method such as QuickTree [30] or RapidNJ [47], or heuristics such as ClearCut [46]. For largerscale sequence classification, machine learning based methods such as (profile) hidden Markov models (HMM) and Support Vector Machines (SVM) are applicable. These machine learning methods are trained on a subset of the data and then used to rapidly classify unknown sequences. A possible alternative to HMM and SVM is Structured Logistic Regression (SLR) [23]. SLR is a recently developed machine learning method that has not been previously applied to largescale classification problems in bioinformatics, but have shown great promise in other types of classification [23]. In this paper we provide a proof-of-concept application of SLR to a large-scale classification problem in bioinformatics. We use classification of PType ATPases as our application because we believe it can generate important biological information. Also the rapidly increasing number of possible P-type ATPases calls for an automated procedure to facilitate the quick analysis of their distribution into different classes to guide biochemical experiments. Since SLR has been shown previously to compare favourable with SVM [23], we have chosen to compare the performance of our SLR based classifier to an profile HMM based classifier, and, for a smaller set of sequences, to a traditional MSA We have applied SLR-classifiers to the entire UniProtKB v. 15.8 [24] to identify new P-type ATPases and further classify them into the 11 known subfamilies. To examine the per-species distribution of ATPases, we have analyzed 1,123 genomes. Furthermore, an analysis of the predicted membrane topology of P-type ATPases found in these genomes shows that the transmembrane region can be described as a three component system containing a core region of 6 transmembrane helices and two elements that reside on the N- and C-terminal part. Description of Structured Logistic Regression Structured Logistic Regression (SLR) is a machine learning tool first proposed in the context of text categorization [23]. SLR takes as input a training set of n samples, {xi, yi}, i = 1,..., n, where xi is a sequence, and yi M {+1,21} are labels indicating the class. The SLR output is a set of discriminating subsequences of unrestricted length (also known as k-mers or n-grams, with k or n unrestricted in this case; in this work we refer to them simply as predictors) together with their weights wj indicative of their discriminative power. The SLR decision function is linear: wj I predictorj [ xi where k is the total number of selected predictors and I(.) is the indicator function. To predict class membership of xi, the score f (xi) is related to the probability that xi belongs to class +1: pyi~z1jxi, w~ The learning algorithm is based on a coordinate-wise gradient ascent optimization technique for iteratively maximizing the likelihood of the training set [23]. Upon optimizing the likelihood, the algorithm outputs a compact set of discriminative predictors to be used for classification. The o (...truncated)