Prediction of protein secondary structures with a novel kernel density estimation based classifier
BMC Research Notes
Prediction of protein secondary structures with a novel kernel density estimation based classifier
Darby Tien-Hao Chang 2
Yu-Yen Ou 0 1
Hao-Geng Hung 6
Meng-Han Yang 6
Chien-Yu Chen 5
Yen-Jen Oyang 3 4 6 7
0 Department of Computer Science and Engineering, Yuan Ze University , Chung-Li, 320, Taiwan, ROC
1 Graduate School of Biotechnology and Bioinformatics, Yuan Ze University , Chung-Li, 320, Taiwan, ROC
2 Department of Electrical Engineering, National Cheng Kung University , Tainan, 70101, Taiwan, ROC
3 Graduate Institute of Networking and Multimedia, National Taiwan University , Taipei, 106, Taiwan, ROC
4 Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University , Taipei, 106, Taiwan, ROC
5 Department of Bio-Industrial Mechatronics Engineering, National Taiwan University , Taipei, 106, Taiwan, ROC
6 Department of Computer Science and Information Engineering, National Taiwan University , Taipei, 106, Taiwan, ROC
7 Center for Systems Biology and Bioinformatics, National Taiwan University , Taipei, 106, Taiwan, ROC
Background: Though prediction of protein secondary structures has been an active research issue in bioinformatics for quite a few years and many approaches have been proposed, a new challenge emerges as the sizes of contemporary protein structure databases continue to grow rapidly. The new challenge concerns how we can effectively exploit all the information implicitly deposited in the protein structure databases and deliver ever-improving prediction accuracy as the databases expand rapidly. Findings: The new challenge is addressed in this article by proposing a predictor designed with a novel kernel density estimation algorithm. One main distinctive feature of the kernel density estimation based approach is that the average execution time taken by the training process is in the order of O(nlogn), where n is the number of instances in the training dataset. In the experiments reported in this article, the proposed predictor delivered an average Q3 (three-state prediction accuracy) score of 80.3% and an average SOV (segment overlap) score of 76.9% for a set of 27 benchmark protein chains extracted from the EVA server that are longer than 100 residues. Conclusion: The experimental results reported in this article reveal that we can continue to achieve higher prediction accuracy of protein secondary structures by effectively exploiting the structural information deposited in fast-growing protein structure databases. In this respect, the kernel density estimation based approach enjoys a distinctive advantage with its low time complexity for carrying out the training process.
Findings
Motivation
In structural biology, protein secondary structures act as
the building blocks for the protein tertiary structures [
1,2
].
Therefore, analysis of protein secondary structures is an
essential intermediate step toward obtaining a
comprehensive picture of the tertiary structure of a protein. In this
respect, one of the main challenges is how to accurately
identify the polypeptide segments that could fold to form
a secondary structure. This problem is normally referred
to as prediction of protein secondary structures [
1,3
].
Though prediction of protein secondary structures has
been an active issue in bioinformatics research for quite a
few years and many approaches have been proposed
[
1,410
], a new challenge emerges as the sizes of contemporary
protein structure databases continue to grow rapidly. The
new challenge, which has been addressed in several
recently completed studies [
9-11
], is concerned with how
we can effectively exploit the information implicitly
deposited in the ever-growing protein structure databases
and deliver ever-improving prediction accuracy. In this
respect, this article proposes the Prote2S predictor
designed with a novel kernel density estimation algorithm
[
12
], which features an average time complexity of
O(nlogn) for carrying out the training process, where n is
the number of instances in the training dataset.
Experimental results
This section reports the experiments conducted to
investigate how Prote2S performs in comparison with the other
existing predictors of protein secondary structures. The
design of Prote2S is based on the relaxed variable kernel
density estimator (RVKDE) that we have recently
proposed [
12
]. In the next section, we will discuss how the
RVKDE has been incorporated in the design of Prote2S
and the related issues.
For Prote2S, the training dataset was derived from the
PDB version available at the end of May, 2007. In order to
guarantee that no protein chains used to generate the
training dataset is homologous to the benchmark protein
chains on the EVA server [
13
], from which the testing
dataset was extracted, BLAST [
14
] was invoked and the
criterion of homology was set to sequence identity higher
than 25%. Then, the CD-HIT clustering algorithm [
15
]
with the similarity thre (...truncated)