Prediction of protein secondary structures with a novel kernel density estimation based classifier

BMC Research Notes, Dec 2008

Background Though prediction of protein secondary structures has been an active research issue in bioinformatics for quite a few years and many approaches have been proposed, a new challenge emerges as the sizes of contemporary protein structure databases continue to grow rapidly. The new challenge concerns how we can effectively exploit all the information implicitly deposited in the protein structure databases and deliver ever-improving prediction accuracy as the databases expand rapidly. Findings The new challenge is addressed in this article by proposing a predictor designed with a novel kernel density estimation algorithm. One main distinctive feature of the kernel density estimation based approach is that the average execution time taken by the training process is in the order of O(nlogn), where n is the number of instances in the training dataset. In the experiments reported in this article, the proposed predictor delivered an average Q3 (three-state prediction accuracy) score of 80.3% and an average SOV (segment overlap) score of 76.9% for a set of 27 benchmark protein chains extracted from the EVA server that are longer than 100 residues. Conclusion The experimental results reported in this article reveal that we can continue to achieve higher prediction accuracy of protein secondary structures by effectively exploiting the structural information deposited in fast-growing protein structure databases. In this respect, the kernel density estimation based approach enjoys a distinctive advantage with its low time complexity for carrying out the training process.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1186%2F1756-0500-1-51.pdf

Prediction of protein secondary structures with a novel kernel density estimation based classifier

BMC Research Notes Prediction of protein secondary structures with a novel kernel density estimation based classifier Darby Tien-Hao Chang 2 Yu-Yen Ou 0 1 Hao-Geng Hung 6 Meng-Han Yang 6 Chien-Yu Chen 5 Yen-Jen Oyang 3 4 6 7 0 Department of Computer Science and Engineering, Yuan Ze University , Chung-Li, 320, Taiwan, ROC 1 Graduate School of Biotechnology and Bioinformatics, Yuan Ze University , Chung-Li, 320, Taiwan, ROC 2 Department of Electrical Engineering, National Cheng Kung University , Tainan, 70101, Taiwan, ROC 3 Graduate Institute of Networking and Multimedia, National Taiwan University , Taipei, 106, Taiwan, ROC 4 Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University , Taipei, 106, Taiwan, ROC 5 Department of Bio-Industrial Mechatronics Engineering, National Taiwan University , Taipei, 106, Taiwan, ROC 6 Department of Computer Science and Information Engineering, National Taiwan University , Taipei, 106, Taiwan, ROC 7 Center for Systems Biology and Bioinformatics, National Taiwan University , Taipei, 106, Taiwan, ROC Background: Though prediction of protein secondary structures has been an active research issue in bioinformatics for quite a few years and many approaches have been proposed, a new challenge emerges as the sizes of contemporary protein structure databases continue to grow rapidly. The new challenge concerns how we can effectively exploit all the information implicitly deposited in the protein structure databases and deliver ever-improving prediction accuracy as the databases expand rapidly. Findings: The new challenge is addressed in this article by proposing a predictor designed with a novel kernel density estimation algorithm. One main distinctive feature of the kernel density estimation based approach is that the average execution time taken by the training process is in the order of O(nlogn), where n is the number of instances in the training dataset. In the experiments reported in this article, the proposed predictor delivered an average Q3 (three-state prediction accuracy) score of 80.3% and an average SOV (segment overlap) score of 76.9% for a set of 27 benchmark protein chains extracted from the EVA server that are longer than 100 residues. Conclusion: The experimental results reported in this article reveal that we can continue to achieve higher prediction accuracy of protein secondary structures by effectively exploiting the structural information deposited in fast-growing protein structure databases. In this respect, the kernel density estimation based approach enjoys a distinctive advantage with its low time complexity for carrying out the training process. Findings Motivation In structural biology, protein secondary structures act as the building blocks for the protein tertiary structures [ 1,2 ]. Therefore, analysis of protein secondary structures is an essential intermediate step toward obtaining a comprehensive picture of the tertiary structure of a protein. In this respect, one of the main challenges is how to accurately identify the polypeptide segments that could fold to form a secondary structure. This problem is normally referred to as prediction of protein secondary structures [ 1,3 ]. Though prediction of protein secondary structures has been an active issue in bioinformatics research for quite a few years and many approaches have been proposed [ 1,410 ], a new challenge emerges as the sizes of contemporary protein structure databases continue to grow rapidly. The new challenge, which has been addressed in several recently completed studies [ 9-11 ], is concerned with how we can effectively exploit the information implicitly deposited in the ever-growing protein structure databases and deliver ever-improving prediction accuracy. In this respect, this article proposes the Prote2S predictor designed with a novel kernel density estimation algorithm [ 12 ], which features an average time complexity of O(nlogn) for carrying out the training process, where n is the number of instances in the training dataset. Experimental results This section reports the experiments conducted to investigate how Prote2S performs in comparison with the other existing predictors of protein secondary structures. The design of Prote2S is based on the relaxed variable kernel density estimator (RVKDE) that we have recently proposed [ 12 ]. In the next section, we will discuss how the RVKDE has been incorporated in the design of Prote2S and the related issues. For Prote2S, the training dataset was derived from the PDB version available at the end of May, 2007. In order to guarantee that no protein chains used to generate the training dataset is homologous to the benchmark protein chains on the EVA server [ 13 ], from which the testing dataset was extracted, BLAST [ 14 ] was invoked and the criterion of homology was set to sequence identity higher than 25%. Then, the CD-HIT clustering algorithm [ 15 ] with the similarity thre (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1186%2F1756-0500-1-51.pdf

Darby Tien-Hao Chang, Yu-Yen Ou, Hao-Geng Hung, Meng-Han Yang, Chien-Yu Chen, Yen-Jen Oyang. Prediction of protein secondary structures with a novel kernel density estimation based classifier, BMC Research Notes, 2008, pp. 51, Volume 1, Issue 1, DOI: 10.1186/1756-0500-1-51