PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions
Dong et al. J Cheminform (2018) 10:16
https://doi.org/10.1186/s13321-018-0270-2
Open Access
SOFTWARE
PyBioMed: a python library for various
molecular representations of chemicals,
proteins and DNAs and their interactions
Jie Dong1,2, Zhi‑Jiang Yao1, Lin Zhang2, Feijun Luo2, Qinlu Lin2, Ai‑Ping Lu3, Alex F. Chen4
and Dong‑Sheng Cao1,3,4*
Abstract
Background: With the increasing development of biotechnology and informatics technology, publicly available
data in chemistry and biology are undergoing explosive growth. Such wealthy information in these data needs to be
extracted and transformed to useful knowledge by various data mining methods. Considering the amazing rate at
which data are accumulated in chemistry and biology fields, new tools that process and interpret large and complex
interaction data are increasingly important. So far, there are no suitable toolkits that can effectively link the chemical
and biological space in view of molecular representation. To further explore these complex data, an integrated toolkit
for various molecular representation is urgently needed which could be easily integrated with data mining algorithms
to start a full data analysis pipeline.
Results: Herein, the python library PyBioMed is presented, which comprises functionalities for online download for
various molecular objects by providing different IDs, the pretreatment of molecular structures, the computation of
various molecular descriptors for chemicals, proteins, DNAs and their interactions. PyBioMed is a feature-rich and
highly customized python library used for the characterization of various complex chemical and biological molecules
and interaction samples. The current version of PyBioMed could calculate 775 chemical descriptors and 19 kinds of
chemical fingerprints, 9920 protein descriptors based on protein sequences, more than 6000 DNA descriptors from
nucleotide sequences, and interaction descriptors from pairwise samples using three different combining strategies.
Several examples and five real-life applications were provided to clearly guide the users how to use PyBioMed as an
integral part of data analysis projects. By using PyBioMed, users are able to start a full pipelining from getting molecu‑
lar data, pretreating molecules, molecular representation to constructing machine learning models conveniently.
Conclusion: PyBioMed provides various user-friendly and highly customized APIs to calculate various features of bio‑
logical molecules and complex interaction samples conveniently, which aims at building integrated analysis pipelines
from data acquisition, data checking, and descriptor calculation to modeling. PyBioMed is freely available at http://
projects.scbdd.com/pybiomed.html.
Keywords: Molecular representation, Molecular descriptors, Python library, Chemoinformatics, Data integration,
Bioinformatics
*Correspondence: oriental‑
1
Xiangya School of Pharmaceutical Sciences, Central South University,
No. 172, Tongzipo Road, Yuelu District, Changsha, People’s Republic
of China
Full list of author information is available at the end of the article
© The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license,
and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/
publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Dong et al. J Cheminform (2018) 10:16
Background
With the increasing development of biotechnology and
informatics technology, the past decade has seen an
exceptional growth in publicly available data in chemistry and biology, especially in human-specific molecular
interaction data. The heterogeneity of data in databases
poses a significant challenge to their integration and analysis in practice [1, 2]. However, the bioinformatics and
the cheminformatics communities have evolved more
or less independently, e.g., with an emphasis on macro
biomolecules and chemical compounds, respectively.
Investigation of interactions is a complex molecular recognition process, which is not only related to the bioinformatics projects that aim at a systematic analysis of the
structure and function of proteins and DNAs that scales
to the genome level, but also to the chemoinformatics
projects that are devoted to the analysis of structure and
biological activity of chemicals. More importantly, systematic investigation of generated knowledge in both the
chemical and biological knowledge spaces is required,
especially in the scenarios of identifying both new targets
and their potential ligands, discovering potential biomarkers for complex diseases, understanding the mechanism of interactions, and discovering new regulatory
mechanism etc. [3–8]. Therefore, it is very necessary to
build informatics platforms for unified data or knowledge
representation that can integrate the existing efforts from
both communities.
Furthermore, wealthy information in these data
needs to be extracted and then transformed to useful
knowledge by various data mining and artificial intelligent methods. Lots of machine learning methods have
been elaborately developed to mine useful biomedicine
information [9–16]. However, in order to apply various
machine learning approaches on molecular data, it is a
common practice to encode molecular information as
numerical features. The type of encoding, however, can
significantly affect analyses, and choosing a precise and
effective encoding is a critical step. Molecular descriptors are one of the most powerful approaches to characterize the biological, physical, and chemical properties of
molecules and have long been used in various studies for
understanding molecular interactions or drug discovery.
These descriptors capture and magnify distinct aspects of
molecular topology in order to investigate how molecular
structure affects molecular property. Molecular features
have frequently been used in the development of machine
learning in QSAR/QSPR [17, 18], virtual screening [19],
similarity search [20], drug absorption, distribution,
metabolism, elimination and toxicity (ADMET) eavaluation [21–24], protein structural and functional classes
[25, 26], protein–protein interactions [27], compound–
protein interactions [28–31], subcellular locations and
Page 2 of 11
peptides of specific properties [32], meiotic recombination hot spots [33], nucleosome positioning in genomes
and other drug discovery processes [34]. In terms of
molecular representation importance, some web servers
and stand-alone programs, such as RDKit [35], CDK [36],
rcdk, PaDEL [37], Cinfony [38], Chemopy [39], ChemDes
[40], BioJava [41], BioTriangle [42], bioclipse [43], pr (...truncated)