Efficient Data Mining Algorithms for Screening Potential Proteins of Drug Target (pdf)

Article PDF cannot be displayed. You can download it here:

http://downloads.hindawi.com/journals/mpe/2017/9852063.pdf

Efficient Data Mining Algorithms for Screening Potential Proteins of Drug Target

Hindawi Mathematical Problems in Engineering Volume 2017, Article ID 9852063, 10 pages https://doi.org/10.1155/2017/9852063 Research Article Efficient Data Mining Algorithms for Screening Potential Proteins of Drug Target Qi Wang, JinCai Huang, YangHe Feng, and JiaWei Fei Science and Technology on Information Systems Engineering Laboratory, College of Information System and Management, National University of Defense Technology, Changsha, Hunan, China Correspondence should be addressed to YangHe Feng; Received 9 December 2016; Revised 22 January 2017; Accepted 16 February 2017; Published 2 March 2017 Academic Editor: Stefan Balint Copyright © 2017 Qi Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The past few decades have witnessed the boom in pharmacology as well as the dilemma of drug development. Playing a crucial role in drug design, the screening of potential human proteins of drug targets from open access database with well-measured physical and chemical properties is a task of challenge but significance. In this paper, the screening of potential drug target proteins (DTPs) from a fine collected dataset containing 5376 unlabeled proteins and 517 known DTPs was researched. Our objective is to screen potential DTPs from the 5376 proteins. Here we proposed two strategies assisting the construction of dataset of reliable nondrug target proteins (NDTPs) and then bagging of decision trees method was employed in the final prediction. Such two-stage algorithms have shown their effectiveness and superior performance on the testing set. Both of the algorithms maintained higher recall ratios of DTPs, respectively, 93.5% and 97.4%. In one turn of experiments, strategy1-based bagging of decision trees algorithm screened about 558 possible DTPs while 1782 potential DTPs were predicted in the second algorithm. Besides, two strategy-based algorithms showed the consensus of the predictions in the results, with approximately 442 potential DTPs in common. These selected DTPs provide reliable choices for further verification based on biomedical experiments. 1. Background In domains of biotechnology, pharmacology, and medicine development, identification of drug targets is to discover new candidate molecules that are active in the process of remedies with drugs. A notation is given in [1] that the drug target is a broad concept ranging from molecular entities such as Ribonucleic Acids (RNAs), genes, and proteins to biological phenomena like phenotypes or pathways. History about the drug development has confirmed a fact that most failures in drug exploration can be attributed to inappropriate targets pursued [2, 3]. It is widely acknowledged that identifying potential targets for intervention is the first and foremost step in the modern drug campaign [1, 4–7], which has attracted increasing attention and focus from both academia and industry. Once the molecule was predicted as drug target, the engineering of drug design would begin in clinical trials. Since such programs, involving huge investments from pharmaceutical corporations and governments, are exactly time-consuming and labor-intensive, the choice of potential targets for experiments seems quite crucial. As the dataset collected in our experiments is trapped in a special case where limited drug target proteins are known while the rest are uncertain in labels, the screening of potential drug target proteins from the unlabeled is complicated. A prior information supported in our research lies in low ratio of “druggable” genomes in humans, approximating to 10% [8]. In the light of this, the nondrug target proteins (NDTPs) would dominate the unlabeled by inference. For more detailed information about our dataset, see Materials and Methods, and our ultimate objective is to screen several reliable drug target proteins (DTPs) from the unlabeled. Looking back to the previous methodologies of identification of drug target proteins (IDTPs), some specific biological hypotheses were required such as side-effect similarity [9], chemical structure, and genomic sequence information [10]. For further review about this, refer to [4]. To overcome the limits on the reliability of hypothesis and explore a robust 2 Mathematical Problems in Engineering Microarray data Data mining technique Candidates list Preclinical test Expert decision 5376 unlabeled proteins 517 known DTPs Figure 1: Process of drug target discovery using data mining techniques. Properties: 31 continuous, 5 nominal way to address the problem as well, we have developed a novel paradigm combining the proteins biochemical characteristics with the booming data mining techniques. Figure 1 shows the process of drug discovery using data mining techniques. Inspired by a family of algorithms with regard to the positive and unlabeled learning, we transferred the existing knowledge into the domain of bioinformatics. A two-stage paradigm was adopted for the screening task, with the final result showing the efficiency of our algorithms. 2. Materials and Methods Figure 2: Information about proteins with properties. dataset with 517 known DTPs and 5376 uncertain NDTPs was employed for the screening task. Specifically, some proteins in the 5376 proteins would be recommended as most likely DTPs from the dataset of uncertain NDTPs. Further information about the dataset for experiments is illustrated in Figure 2 and supporting materials are in the website http://pan .baidu.com/s/1pLDCkcF. 2.1. Data Collection and Preliminary Analysis 2.1.1. Data Collection. Proteins, as one of the main sources of drug targets, have been a lasting heated topic for researchers from various domains. Some of them interact with each other, forming the basis of signal transduction pathways and transcriptional regulatory networks. As the focus of our research, proteins of drug targets are those functional biomolecules addressed and controlled by some active compounds. In this paper, we collected proteins from the DrugBank Database (Version 3.0) in which 1604 proteins were annotated as drug targets [11]. Further data cleaning was imposed by removing the nonhuman proteins as well as those sequences larger than 20% using PISCES [12]. As the compounds of atoms and molecules, whether the protein can be the candidate for the drug targets is frequently determined by factors like water solubility, hydrogen ion concentration (pH), trait of bases, and its structure. Though the interaction relations provide the additional information for the screening, they are not exactly reliable. Other properties of proteins also originate from the basic chemical or physical properties of proteins in essence. Our selected properties in the research were just some basic chemical or physical properties of proteins. We followed the extracting process in (...truncated)