PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1186%2Fs13321-018-0270-2.pdf

PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions

Dong et al. J Cheminform (2018) 10:16 https://doi.org/10.1186/s13321-018-0270-2 Open Access SOFTWARE PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions Jie Dong1,2, Zhi‑Jiang Yao1, Lin Zhang2, Feijun Luo2, Qinlu Lin2, Ai‑Ping Lu3, Alex F. Chen4 and Dong‑Sheng Cao1,3,4* Abstract Background: With the increasing development of biotechnology and informatics technology, publicly available data in chemistry and biology are undergoing explosive growth. Such wealthy information in these data needs to be extracted and transformed to useful knowledge by various data mining methods. Considering the amazing rate at which data are accumulated in chemistry and biology fields, new tools that process and interpret large and complex interaction data are increasingly important. So far, there are no suitable toolkits that can effectively link the chemical and biological space in view of molecular representation. To further explore these complex data, an integrated toolkit for various molecular representation is urgently needed which could be easily integrated with data mining algorithms to start a full data analysis pipeline. Results: Herein, the python library PyBioMed is presented, which comprises functionalities for online download for various molecular objects by providing different IDs, the pretreatment of molecular structures, the computation of various molecular descriptors for chemicals, proteins, DNAs and their interactions. PyBioMed is a feature-rich and highly customized python library used for the characterization of various complex chemical and biological molecules and interaction samples. The current version of PyBioMed could calculate 775 chemical descriptors and 19 kinds of chemical fingerprints, 9920 protein descriptors based on protein sequences, more than 6000 DNA descriptors from nucleotide sequences, and interaction descriptors from pairwise samples using three different combining strategies. Several examples and five real-life applications were provided to clearly guide the users how to use PyBioMed as an integral part of data analysis projects. By using PyBioMed, users are able to start a full pipelining from getting molecu‑ lar data, pretreating molecules, molecular representation to constructing machine learning models conveniently. Conclusion: PyBioMed provides various user-friendly and highly customized APIs to calculate various features of bio‑ logical molecules and complex interaction samples conveniently, which aims at building integrated analysis pipelines from data acquisition, data checking, and descriptor calculation to modeling. PyBioMed is freely available at http:// projects.scbdd.com/pybiomed.html. Keywords: Molecular representation, Molecular descriptors, Python library, Chemoinformatics, Data integration, Bioinformatics *Correspondence: oriental‑ 1 Xiangya School of Pharmaceutical Sciences, Central South University, No. 172, Tongzipo Road, Yuelu District, Changsha, People’s Republic of China Full list of author information is available at the end of the article © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Dong et al. J Cheminform (2018) 10:16 Background With the increasing development of biotechnology and informatics technology, the past decade has seen an exceptional growth in publicly available data in chemistry and biology, especially in human-specific molecular interaction data. The heterogeneity of data in databases poses a significant challenge to their integration and analysis in practice [1, 2]. However, the bioinformatics and the cheminformatics communities have evolved more or less independently, e.g., with an emphasis on macro biomolecules and chemical compounds, respectively. Investigation of interactions is a complex molecular recognition process, which is not only related to the bioinformatics projects that aim at a systematic analysis of the structure and function of proteins and DNAs that scales to the genome level, but also to the chemoinformatics projects that are devoted to the analysis of structure and biological activity of chemicals. More importantly, systematic investigation of generated knowledge in both the chemical and biological knowledge spaces is required, especially in the scenarios of identifying both new targets and their potential ligands, discovering potential biomarkers for complex diseases, understanding the mechanism of interactions, and discovering new regulatory mechanism etc. [3–8]. Therefore, it is very necessary to build informatics platforms for unified data or knowledge representation that can integrate the existing efforts from both communities. Furthermore, wealthy information in these data needs to be extracted and then transformed to useful knowledge by various data mining and artificial intelligent methods. Lots of machine learning methods have been elaborately developed to mine useful biomedicine information [9–16]. However, in order to apply various machine learning approaches on molecular data, it is a common practice to encode molecular information as numerical features. The type of encoding, however, can significantly affect analyses, and choosing a precise and effective encoding is a critical step. Molecular descriptors are one of the most powerful approaches to characterize the biological, physical, and chemical properties of molecules and have long been used in various studies for understanding molecular interactions or drug discovery. These descriptors capture and magnify distinct aspects of molecular topology in order to investigate how molecular structure affects molecular property. Molecular features have frequently been used in the development of machine learning in QSAR/QSPR [17, 18], virtual screening [19], similarity search [20], drug absorption, distribution, metabolism, elimination and toxicity (ADMET) eavaluation [21–24], protein structural and functional classes [25, 26], protein–protein interactions [27], compound– protein interactions [28–31], subcellular locations and Page 2 of 11 peptides of specific properties [32], meiotic recombination hot spots [33], nucleosome positioning in genomes and other drug discovery processes [34]. In terms of molecular representation importance, some web servers and stand-alone programs, such as RDKit [35], CDK [36], rcdk, PaDEL [37], Cinfony [38], Chemopy [39], ChemDes [40], BioJava [41], BioTriangle [42], bioclipse [43], pr (...truncated)