Efficient Data Mining Algorithms for Screening Potential Proteins of Drug Target
Hindawi
Mathematical Problems in Engineering
Volume 2017, Article ID 9852063, 10 pages
https://doi.org/10.1155/2017/9852063
Research Article
Efficient Data Mining Algorithms for Screening Potential
Proteins of Drug Target
Qi Wang, JinCai Huang, YangHe Feng, and JiaWei Fei
Science and Technology on Information Systems Engineering Laboratory, College of Information System and Management,
National University of Defense Technology, Changsha, Hunan, China
Correspondence should be addressed to YangHe Feng;
Received 9 December 2016; Revised 22 January 2017; Accepted 16 February 2017; Published 2 March 2017
Academic Editor: Stefan Balint
Copyright © 2017 Qi Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The past few decades have witnessed the boom in pharmacology as well as the dilemma of drug development. Playing a crucial role
in drug design, the screening of potential human proteins of drug targets from open access database with well-measured physical
and chemical properties is a task of challenge but significance. In this paper, the screening of potential drug target proteins (DTPs)
from a fine collected dataset containing 5376 unlabeled proteins and 517 known DTPs was researched. Our objective is to screen
potential DTPs from the 5376 proteins. Here we proposed two strategies assisting the construction of dataset of reliable nondrug
target proteins (NDTPs) and then bagging of decision trees method was employed in the final prediction. Such two-stage algorithms
have shown their effectiveness and superior performance on the testing set. Both of the algorithms maintained higher recall ratios
of DTPs, respectively, 93.5% and 97.4%. In one turn of experiments, strategy1-based bagging of decision trees algorithm screened
about 558 possible DTPs while 1782 potential DTPs were predicted in the second algorithm. Besides, two strategy-based algorithms
showed the consensus of the predictions in the results, with approximately 442 potential DTPs in common. These selected DTPs
provide reliable choices for further verification based on biomedical experiments.
1. Background
In domains of biotechnology, pharmacology, and medicine
development, identification of drug targets is to discover new
candidate molecules that are active in the process of remedies
with drugs. A notation is given in [1] that the drug target
is a broad concept ranging from molecular entities such as
Ribonucleic Acids (RNAs), genes, and proteins to biological
phenomena like phenotypes or pathways.
History about the drug development has confirmed a fact
that most failures in drug exploration can be attributed to
inappropriate targets pursued [2, 3]. It is widely acknowledged that identifying potential targets for intervention is the
first and foremost step in the modern drug campaign [1, 4–7],
which has attracted increasing attention and focus from both
academia and industry. Once the molecule was predicted as
drug target, the engineering of drug design would begin in
clinical trials. Since such programs, involving huge investments from pharmaceutical corporations and governments,
are exactly time-consuming and labor-intensive, the choice of
potential targets for experiments seems quite crucial.
As the dataset collected in our experiments is trapped in
a special case where limited drug target proteins are known
while the rest are uncertain in labels, the screening of potential drug target proteins from the unlabeled is complicated.
A prior information supported in our research lies in low
ratio of “druggable” genomes in humans, approximating to
10% [8]. In the light of this, the nondrug target proteins
(NDTPs) would dominate the unlabeled by inference. For
more detailed information about our dataset, see Materials
and Methods, and our ultimate objective is to screen several
reliable drug target proteins (DTPs) from the unlabeled.
Looking back to the previous methodologies of identification
of drug target proteins (IDTPs), some specific biological
hypotheses were required such as side-effect similarity [9],
chemical structure, and genomic sequence information [10].
For further review about this, refer to [4]. To overcome the
limits on the reliability of hypothesis and explore a robust
2
Mathematical Problems in Engineering
Microarray data
Data mining technique
Candidates list
Preclinical test
Expert
decision
5376 unlabeled proteins
517 known DTPs
Figure 1: Process of drug target discovery using data mining
techniques.
Properties: 31 continuous, 5 nominal
way to address the problem as well, we have developed a novel
paradigm combining the proteins biochemical characteristics
with the booming data mining techniques. Figure 1 shows
the process of drug discovery using data mining techniques.
Inspired by a family of algorithms with regard to the positive
and unlabeled learning, we transferred the existing knowledge into the domain of bioinformatics. A two-stage paradigm was adopted for the screening task, with the final result
showing the efficiency of our algorithms.
2. Materials and Methods
Figure 2: Information about proteins with properties.
dataset with 517 known DTPs and 5376 uncertain NDTPs was
employed for the screening task. Specifically, some proteins
in the 5376 proteins would be recommended as most likely
DTPs from the dataset of uncertain NDTPs. Further information about the dataset for experiments is illustrated in Figure 2 and supporting materials are in the website http://pan
.baidu.com/s/1pLDCkcF.
2.1. Data Collection and Preliminary Analysis
2.1.1. Data Collection. Proteins, as one of the main sources of
drug targets, have been a lasting heated topic for researchers
from various domains. Some of them interact with each other,
forming the basis of signal transduction pathways and transcriptional regulatory networks. As the focus of our research,
proteins of drug targets are those functional biomolecules
addressed and controlled by some active compounds. In this
paper, we collected proteins from the DrugBank Database
(Version 3.0) in which 1604 proteins were annotated as drug
targets [11]. Further data cleaning was imposed by removing
the nonhuman proteins as well as those sequences larger than
20% using PISCES [12]. As the compounds of atoms and
molecules, whether the protein can be the candidate for the
drug targets is frequently determined by factors like water
solubility, hydrogen ion concentration (pH), trait of bases,
and its structure. Though the interaction relations provide
the additional information for the screening, they are not
exactly reliable. Other properties of proteins also originate
from the basic chemical or physical properties of proteins
in essence. Our selected properties in the research were just
some basic chemical or physical properties of proteins. We
followed the extracting process in (...truncated)