Predicting Chemical Toxicity Effects Based on Chemical-Chemical Interactions
Citation: Chen L, Lu J, Zhang J, Feng K-R, Zheng M-Y, et al. (
Predicting Chemical Toxicity Effects Based on Chemical- Chemical Interactions
Lei Chen 0
Jing Lu 0
Jian Zhang 0
Kai-Rui Feng 0
Ming-Yue Zheng 0
Yu-Dong Cai 0
Gajendra P. S. Raghava, CSIR-Institute of Microbial Technology, India
0 1 Institute of Systems Biology, Shanghai University , Shanghai , China , 2 College of Information Engineering, Shanghai Maritime University , Shanghai , China , 3 Drug Discovery and Design Center (DDDC), Shanghai Institute of Materia Medica , Shanghai , China , 4 Department of Ophthalmology, Shanghai First People's Hospital Affiliated to Shanghai Jiaotong University , Shanghai, China, 5 Simcyp Limited , Blades Enterprise Centre , Sheffield , United Kingdom
Toxicity is a major contributor to high attrition rates of new chemical entities in drug discoveries. In this study, an orderclassifier was built to predict a series of toxic effects based on data concerning chemical-chemical interactions under the assumption that interactive compounds are more likely to share similar toxicity profiles. According to their interaction confidence scores, the order from the most likely toxicity to the least was obtained for each compound. Ten test groups, each of them containing one training dataset and one test dataset, were constructed from a benchmark dataset consisting of 17,233 compounds. By a Jackknife test on each of these test groups, the 1st order prediction accuracies of the training dataset and the test dataset were all approximately 79.50%, substantially higher than the rate of 25.43% achieved by random guesses. Encouraged by the promising results, we expect that our method will become a useful tool in screening out drugs with high toxicity.
Funding: This work was supported by National Basic Research Program of China (2011CB510102, 2011CB510101), National Natural Science Foundation of China
(61202021, 81001399), National S&T Major Project (2012ZX09301-001-002), Hi-TECH Research and Development Program of China (2012AA020308), Innovation
Program of Shanghai Municipal Education Commission (12YZ120, 12ZZ087), Shanghai Educational Development Foundation (12CG55), and Science and
Technology Program of Shanghai Maritime University (No. 20120105). The funders had no role in study design, data collection and analysis, decision to publish, or
preparation of the manuscript.
Competing Interests: One co-author, Dr. Kai-Rui Feng is employed by a commercial company, Simcyp Limited, but he contributed to the study in his spare
time. This project doesnt receive any grants or financial support from Simcyp Limited. Also, Simcyp Limited has no influence on any aspects of this research. This
does not alter the authors adherence to all the PLOS ONE policies on sharing data and materials.
. These authors contributed equally to this work.
Toxicity is a key cause of late-stage failures in drug discovery.
Even some approved drugs such as Phenacetin  and
Troglitazone  have been withdrawn from the market because
of unexpected toxicities that were not detected during Phase III
clinical trials. Thus, early toxicology data on compounds are
needed to reduce R&D costs. Evaluating toxicity and assessing
risks of diverse chemicals require comprehensive experimental
testing against a broad spectrum of toxicity end points. These tests
can cost millions of dollars, involving several thousand animals,
and take many years to complete. As a result, very few chemicals
have undergone the degree of testing needed to support accurate
health risk assessments or meet regulatory requirements for drug
approval. In recent years, the number of synthetic compounds has
surged with the advance of combinatorial chemistry, and
accordingly large quantities of toxicity data are urgently
Recently, particular interest has been raised to apply fast and
cost-effective in silico toxicological models to supplement those
in vitro and in vivo testing. These models require high quality
toxicity data for a large set of structurally diverse drug candidates.
Accelrys Toxicity is a database of toxicity information compiled
from the open scientific literature  and containing toxicological
data for approximately 0.17 million chemicals. This database is of
great value for investigating the pharmacokinetic properties,
metabolism and potential toxicities of compounds. Six types of
toxicity data are collected in the database: (1) Acute Toxicity; (2)
Mutagenicity; (3) Tumorigenicity; (4) Skin and Eye Irritation; (5)
Reproductive Effects; and (6) Multiple Dose Effects. It should be
noted that these categories have multiple and overlapping
mechanisms of toxic action and each category represents only
specific types of experiments. The combination of these
experimental results may help define the overall safety profile of a
compound. However, this kind of databases only provides
toxicological information for recorded compounds, not for new
ones. It would be valuable to accurately predict toxicities of a new
compound based on the information available for recorded
compounds. In order to meet the demand, there is a drive to
develop quick, reliable, and non-animal-involved prediction
methods, e.g. using structure-activity relationships (SARs) to
predict drugs toxicities.
Currently, most toxicological SAR models belong to binary
classifiers, which only predict compounds to be toxic or non-toxic
within a single toxicity class [4,5]. It is desired to modify the
strategy to predict a series of toxicity effects. In this study, we chose
to build a multiclass model [6,7] to predict six categories of toxicity
using the Accelrys Toxicity database instead of only one or two
toxicity endpoints. However, the quadratic optimization problem
Skin and Eye Irritation
Multiple Dose Effects
in multiclass models is difficult to solve. Thus, many previous
multiclass approaches tended to decompose a multiclass problem
into multiple independent binary classifications. Investigators built
a set of binary classifiers, such as the model of Dietterich et al ,
each classifier distinguishing only one of the classes from the
others. Although this greatly simplifies the problem, such an
approach cannot provide order prediction information for the
query compounds. That is, it can only predict whether the query
compound has some toxicity end points, but cannot determine
which is the most likely toxicity, or even the order of toxicity end
points by toxicity likelihoods.
In recent years, the assessment of protein-protein interactions
has been widely used to predict many attributes of proteins
[8,9,10,11]. Furthermore, multiclass predictions of protein
attributes have become more common [12,13,14]. These methods and
their results show that interactive proteins tend to share the same
functions with higher probability than do non-interactive ones.
Likewise, it is reasonable to expect that interactive compounds are
also more likely to share common functions as indicated by some
pioneer studies [15,16]. Thus, toxicity, as part of the biological
functions of compounds, should follow the same rule. Moreover,
based on a previous work on the Anatomical Therapeutic
Chemical (ATC) classification of drugs , compared to the
SAR models based on physicochemical descriptors or structural
alerts, a model based on chemical-chemical interactions can rank
the order of the predictions more easily and yield better prediction
results. In our study, we attempt to quantify chemical-chemical
interactions for each pair of interactive compounds, and obtain the
confidence scores of the interactions by which the toxicity end
points were ordered. Briefly, compounds of seven categories
including six categories of toxicity plus non-toxicity were collected.
The interactive compounds of each query compound were
identified utilizing STITCH (Search tool for interactions of
chemicals) [17,18]. Then, the score of each class of the query
compound was obtained from the confidence scores of interactions
between the query compound and its interactive compounds using
the toxicity profile of the interactive compounds. Finally, the
prediction quality of the model was evaluated using the Jackknife
test through ten test groups. Each of these was constructed from
the benchmark dataset and contained one training dataset and one
external test dataset. Details are described in the following
Materials and Methods
We obtained a total of 171,266 compounds from the Accelrys
Toxicity Database 2011.4 , which had at least one toxicity
effect belonging to the following six categories: (1) Acute Toxicity;
(2) Mutagenicity; (3) Tumorigenicity; (4) Skin and Eye Irritation;
(5) Reproductive Effects; (6) Multiple Dose Effects. Based on
compound toxicity, these compounds are allocated to the 6
categories, allowing multiple assignments. In addition, 2,871
nontoxic compounds including FDA-approved drugs from DrugBank
 and endogenic metabolites from the Human Metabolome
database (HMDB)  were collected and labeled as a negative
class. For convenience, the non-toxic set is regarded as the 7th
category of compound toxicity. Due to lack of chemical-chemical
interaction information in STITCH [17,18], some compounds
cannot be investigated by this approach. After excluding these
compounds, a benchmark dataset S consisting of 17,233
compounds was retrieved, of which 16,587 were toxic and 646
were non-toxic. These compounds are classified into 7 categories
of compound toxicity. Shown in Table 1 is the distribution of
compounds in each category. The codes of 17,233 compounds and
their toxicity information can be found in Table S1.
It is observed from Table 1 that the sum of the number of
compounds in all the 7 categories is much larger than the number
of compounds, indicating that some compounds are allocated to
more than one category of toxicity. Of the 17,233 compounds in
the benchmark dataset, 10,151 compounds belong to only one
category of toxicity, 3,475 compounds belong to two categories of
toxicity, while others belong to 35 categories of toxicity and no
compounds belong to more than five categories of toxicity - refer
to Figure 1 for a plot of the number of compounds against the
number of categories of toxicity. Thus, prediction of compound
toxicity is a multi-label classification problem. Like the case of
processing proteins or compounds with multiple attributes
[15,16,22], the proposed method would provide a series of
candidate toxicities, ranging from the most to the least likely,
instead of presenting only the most likely one.
To sufficiently evaluate the prediction method described in the
following section, we constructed 10 test groups, denoted by
TG1,TG2, . . . ,TG10, respectively. In each test group
TGi(1i10), there is one training dataset S(tri) and one test
dataset St(ie), i.e., TGi~SSt(ir),St(ie)T, where the test dataset consisted
of 1,723 compounds which were randomly selected from S, while
the training dataset contained the remaining 15,510 samples in S,
i.e., S~S(tir)|St(ie) for each 1i10. It is necessary to point out
that, in each test group, the portion of the data in each class of the
test dataset is roughly the same as that of the training dataset.
Shown in Table 2 is the distribution of compounds in training
and test datasets of each test group.
It is known that two proteins that can interact with each other
are more likely to share common biological functions than
noninteractive ones [8,9,10,11]. Likewise, two interactive compounds
are also more likely to share similar biological functions [15,16].
Since toxicity is one of a compounds properties and functions,
utilizing chemical-chemical interactions to identify compound
toxicity is deemed to be feasible.
The data for chemical-chemical interactions were retrieved
from STITCH (chemical_chemical.links.detailed.v3.0.tsv.gz,
http://stitch.embl.de/cgi/show_download_page.) , a
wellknown database including known and predicted interactions of
chemicals and proteins collected from experiments, literature or
other reliable sources. In the obtained file, the interaction unit
contains two compounds and five kinds of scores with titles
Similarity, Experimental, Database, Textmining and
Combined_score. The last kind of score was used here to
indicate the interactivity of two compounds, i.e., two compounds
with Combined_score greater than zero were deemed
interactive compounds, because the last kind of score integrates the
information of the other kinds of scores. Thus, the considered
interactive compounds in this study contain the following three
categories: (1) those participating in the same reactions; (2) those
sharing similar structures or activities and (3) those with literature
associations . It is known that these categories correspond to
the following three facts: (I) compounds involved in the same
reactions occupy the same biological pathways; (II) compounds
with similar structures or activities are likely to share similar
functions, thereby occupying the same pathways with high
probability; (III) the co-occurrence of two compounds, as noted
in many studies, indicates some direct or indirect relationships,
suggesting that they have the potential to share the same pathways.
On the other hand, compounds in the same biological pathways
always induce similar side effects, thereby having similar toxicity
effects. Accordingly, it is reasonable to suppose that interactive
compounds tend to have similar toxicity effects.
The value of the Combined_score of two interactive
compounds indicates the likelihood that they can interact, i.e.,
two interactive compounds with high Combined_score can
interact with high probability. Thus, this score is also termed a
confidence score in this study. For two compounds c1 and c2, let us
denote the confidence score of an interaction between them by
Q(c1,c2). Specifically, if there is no interaction information between
c1 and c2 based on the current records in STITCH, their
interaction confidence score is assigned zero, i.e., Q(c1,c2) = 0. In
this study, 323,432 interaction units, i.e., 323,432 pairs of
compounds with confidence scores greater than 0, were used to
predict compound toxicity. The detailed information on these
interaction units can be found in Table S2.
As is mentioned in the above section, interactive compounds are
more likely to have common toxicity. Accordingly, the toxicities of
a query compound can be identified according to its interactive
For convenience, let T1, T2, , T7 denote the seven categories
of toxicity, where T1 denotes Acute Toxicity, T2
Mutagenicity, and so forth (see column 1 and 2 of Table 1). Suppose that
there are n compounds in the training dataset, that is c1, c2, , cn,
the toxicity of a compound ci in the training dataset is formulated
T (ci)~ti,1,ti,2, . . . ,ti,7 (i~1,2, . . . ,n)
1 If ci has toxicity Tj
Given a query compound cq, its toxicity is predicted not only by its
interactive compounds but also by the confidence scores of their
interactions. The score indicating that the query compound cq has
toxicity Tj is calculated by
H(cq.Tj)~ X Q(ci,cq):ti,jj~1,2,3,4,5,6,7
The high score H(cq.Tj ) means that there are many
interactive compounds of cq in the training dataset that have
toxicity Tj or some interactions between cq and its interactive
compounds having toxicity Tj are labeled by high confidence
scores. In view of this, the greater the score H(cq.Tj), the more
likely that the compound cq has toxicity Tj. In particular, if
H(cq.Tj) for some j, it is indicated that the probability that the
query cq having the j-th category of toxicity is zero because there
are no interactive compounds of cq in the training dataset that have
Since this is a multi-label classification problem, i.e., some
compounds have more than one category of toxicity. A prediction
method only providing the most likely toxicity is not an optimal
choice. Thus, our method is valuable in that it can provide a series
of candidate toxicities for a query compound, ranging from the
most likely to the least likely. For example, if the results obtained
from Eq. 3 are
it can be interpreted to mean that there are three candidate
toxicities for the query compound cq, and the most likely toxicity
for cq is T3 (Tumorigenicity, cf. Table 1), followed by T1
(Acute Toxicity) and T6 (Multiple Dose Effects). In addition,
T3 is called the 1st order prediction, T1 the 2nd order prediction,
and so forth.
The Jackknife test  is often used to examine the
performance of various predictors, because it can always provide
a unique prediction result for a given dataset. It has been widely
used by investigators to evaluate their predictors
[23,24,25,26,27,28,29,30,31,32,33]. During the test, each sample
in the training dataset is singled out one-by-one and tested by the
predictor trained by the other samples. Thus, each sample is tested
The j-th order prediction accuracy is calculated by the following
where CTj denotes the number of compounds whose j-th order
prediction is one of its true toxicities, and N denotes the total
number of compounds in the dataset. If a prediction method can
obtain high Cj with small j and low Cj with large j, it implies that
the method arranges the candidate toxicities well. Among them,
the 1st order prediction accuracy is the most important indicator of
good or bad performance.
Although the seven prediction accuracies can be obtained by
Eq. 5, none of them provides the overall prediction accuracy. In
view of this, we employ another measurement that calculates the
proportion of true toxicities of the first m predictions. It can be
calculated as follows :
where Si,m represents the number of the correct predictions of the
i-th compound among its first m predictions, and Ni represents the
number of toxicities that the i-th compound has. Since different
compounds may have different numbers of toxicities, the
parameter m in Eq. 6 is usually taken as the smallest integer no
less than the average number of toxicities in the dataset, which can
As described in the Section Benchmark dataset, 10 test
groups were constructed to evaluate the method described in
Section Prediction method. In each test group, there were one
training dataset consisting of 15,510 compounds and one test
dataset containing 1,723 compounds. The predicted results for
each test group obtained by the proposed method are as follows.
Performance of the Method on the Training Dataset
For the 15,510 compounds in each training dataset
S(tri)(1i10), we conducted the prediction and evaluated its
performance by the Jackknife test. Listed in the column with title
S(tri) of Table 3 are seven prediction accuracies, calculated by Eq.
5, for training dataset St(ri), from which we can see that the 1st order
prediction accuracies were all around 79.50%, where the
maximum was 79.57%, while the minimum was 79.23%; the
2nd order ones were all around 37.30%. It is indicated that the
proposed method is very stable. It is also observed from the
corresponding columns of Table 3 that the accuracies followed a
descending trend when increasing the order number, indicating
that the method sorted the candidate toxicities quite well for the
compounds in each training dataset St(ir)(1i10). The average
be computed by
where m~qMr. Obviously, a larger Dm implies better prediction
performance by the method for the identification of compound
Tag of toxicity class
Its interactive compound ID
Tag of toxicity class
numbers of toxicities for compounds in each training dataset St(ri)
were about 1.78 according to Eq. 7, i.e., M = 1.78. It is noteworthy
that if one predicts compound toxicity by random guesses, the
average success rate would be only 25.43% (1.78/7), which is
much lower than each of the 1st order prediction accuracies by our
method. To evaluate the prediction accuracy by the method more
thoroughly, Eq. 6 was calculated by taking m = 2, i.e., we
considered the first two predictions for each compound in
St(ri)(1i10) to see the proportions of true toxicities covered
by these predictions. These proportions are shown in column 2 of
Table 4, from which we can see that they were all about 65.50%,
where the maximum was 65.61% while the minimum was
65.32%. Thus, it is indicated once again that our method is
Performance of the Method on the Test Dataset
For the 1,723 compounds in each test dataset St(ie)(1i10),
the toxicities of these compounds were predicted by the proposed
method described in Section Prediction method based on the
compounds in the training dataset St(ir). After processing by Eq. 5,
seven prediction accuracies for each test dataset St(ei) were obtained
and were listed in the column with title S(tie) of Table 3. It is
observed that the 1st order prediction accuracies were all about
1,485 2,027 2,075 3,446
(11.0%) (15.6%) (15.9%) (25.7%)
1,720 1,213 1,336 1,723
(25.7%) (16.7%) (18.4%) (20.1%)
aThe number of common compounds belonging to two categories.
bThe number in parenthesis means the ratio of the number of common
compounds to the number of non-overlapping compounds of the two
79.50%. Similar to the seven prediction accuracies for each
training dataset S(tir), those of test dataset S(tei) also followed a
descending trend with the increase of the order number, implying
that our method also arranged the candidate toxicities of samples
in each test dataset quite well. According to Eq. 7, the average
numbers of toxicities for the compounds in each test dataset were
about 1.80. Thus, we still considered the first two predictions of
each sample in St(ei)(1i10) to calculate the proportions of true
toxicities covered by these predictions, i.e., computing Eq. 6 by
taking m = 2. Listed in column 3 of Table 4 are ten proportions
for ten test datasets, each yielding a probability of approximately
Understanding of the Toxicity Prediction Results
It is observed from Table 3 that the performance of the method
on ten test groups is similar. Thus, the first test group (i.e., TG1) is
used as an example to show how to interpret the toxicity predicting
results in detail.
Our multiclass model achieved a quite promising performance
using the chemical-chemical interactions data on test group TG1
(see Table 3 for details). For example, the compound
NNK) shows positive results for five toxicity endpoints: T1, T2, T3,
T5, and T6. Our model accurately predicted these five kinds of
endpoints, and provided the order predictions as T3. T2.
T1.T6. T5. T4.T7. The 7th label representing non-toxic was
ranked as the last, suggesting that this compound is very likely to
have toxic effects. As stated in the Section Chemical-chemical
interactions, the interactive compounds derived from STITCH
tend to have the same toxicity categories.
4-(Methylnitrosamino)1-(3-pyridyl)-1-butanol (CID000104856, NNAL), an interactive
compound of NNK, has toxicities T2 and T3, which are also
shared by NNK. The alkyl N-nitroso group (see Figure 2) of these
two compounds associates with the formation of DNA adducts,
and induces lung cancer in laboratory animals [34,35,36]. Another
example is trimethoprim (CID000005578), which is positive for
five toxicity endpoints: T1, T2, T4, T5, and T6. The prediction
order of our model was T1. T6. T2.T5. T4. T3.T7. This
compound was considered to be a carcinogen according to
chemical-chemical interactions, but the Accelrys Toxicity database
 labeled this compound only as a mutagen. However, it is
reasonable to assume this compound as a carcinogen because it
has a genotoxic toxicophore-aromatic amine (see Figure 2)
[5,37,38]. Typically, mutation is one of the first steps in the
development of cancer .
Figure 3. Nongeneric SAs (Benigni) and some carcinogens matching these SAs.
Tasosartan (CID000060919) is an angiotensin II (AngII)
receptor blocker , which is labeled as a relatively non-toxic
compound in the dataset. Using our model, the order prediction of
this compound was T7. T1. T6. T2. The 1st order prediction is
non-toxic, consistent with the experimental data available.
Among seven interactive compounds in the training dataset
retrieved from STITCH (see Table 5), the top five interactive
compounds are non-toxic, and their confidence scores are
relatively high. However, the latter two interactive compounds are
toxic, so tasosartan is predicted to have some toxicity effects in our
model. However, the possibility of its possessing these toxicities is
less than that of its not possessing toxicity (i.e., non-toxic).
The predictions for NNK, trimethoprim, and tasosartan and the
prediction accuracies of the method indicate that interactive
compounds can share common toxicity with high probability,
which assessment conforms to the results of predicting other
attributes of compounds [15,16]. The confidence scores of
chemical-chemical interactions contribute significantly to the
prediction of compound toxicity. As shown in Table 5, the
interactive compounds of tasosartan with high confidence scores
dominantly have the same toxicity as tasosartan. On the other
hand, the predicted results for NNK, trimethoprim, and tasosartan
reflect a limitation of our model: the judgment of toxic or
nontoxic is based on a collective set of compounds with interactive
information. However, some compounds with low confidence
scores exist and they may contribute to the input of promiscuous
interaction information to the final classification model. To
address this issue, a future endeavor should introduce a threshold
to the interaction confidence score and exclude noisy
information to obtain a more accurate prediction.
Moreover, many more compounds are without
chemicalchemical interactions in the original Accelrys Toxicity database.
It is expected that the problem of predicting compound toxicity
can be solved more favorably by the method as increasing
amounts of chemical-chemical interaction information become
Analysis of the Relationship between Different Chemical
In the Accelrys Toxicity Database, there are 3,607 compounds
with more than two types of toxicity effects and 3,475 compounds
with exact two effects (refer to Figure 1). We analyzed the
number of common compounds belonging to two categories, and
the ratio of the number of common compounds to the number of
non-overlapping compounds of the two categories (see Table 6).
It can be found that the intersection of T5 (Reproductive Effects,
cf. Table 1) and T6 (Multiple Dose Effects) is the largest,
sharing 26.6% of common compounds. The overlapping
compounds suggest that there may be a causal relationship between
the two categories. Specifically, the reproductive effects may cause
multiple dose effects, i.e., reproductive toxicities may be
cumulative, and hence be regarded as showing multiple dose effects in the
meantime. The followed instances of correspondence between two
categories are T2 (Mutagenicity) vs. T3 (Tumorigenicity) and
T1 (Acute Toxicity) vs. T6 (Multiple Dose Effects). Since, in
many cases, mutation is one of the first steps in the development of
cancer , we took T2 (Mutagenicity) vs. T3
(Tumorigenicity) as an example to study the relationship between the two toxic
From the viewpoint of mechanism of action, carcinogens can be
classified into genotoxic or epigenetic carcinogens. Genotoxic
carcinogens can bind covalently to DNA, and many known
mutagens belong to this category. In the dataset, there are 1,720
common compounds with simultaneous toxicity T2
(Mutagenicity) and T3 (Tumorigenicity). The Structural alerts (SAs)
provided by Benigni , which are molecular functional groups
associated with a specific toxicity end point , were used here to
gain insights into the correspondence of the two toxic effects. As
summarized in Table S3, we illustrated a few examples for each
of the matched SAs.
As previously mentioned, not all of the mutagens are
carcinogens. For example, a,b-unsaturated carbonyl compounds
can interact with DNA by Michael addition, then lead to
mutagenic and carcinogenic responses , e.g. acrylamide
(CID000006579) and 2-butenal (CID000447466). However, if an
a,b-unsaturated carbonyl compound has conformational
constraints or alkyl groups at the site of nucleophilic attack, the
compound would be prone to reaction via Schiff base formation
. This change may only generate the DNA-adducts, but not
undergo the following carcinogenic process . This means that
this kind of compound has no carcinogenicity, e.g.
(E)-2-methyl-2butenal (CID005321950) and 2-propylacrolein (CID000070609).
Epigenetic carcinogens do not usually bind directly to DNA, but
have a large variety of different and specific mechanisms, and
behave negatively in the standard mutagenicity assay . Thus,
some compounds that can match nongeneric SAs  are only
carcinogens, not mutagens (see Figure 3).
In this study, a multi-classifier for six toxicity effects was built
based on 17,233 compounds with their experimental toxicity
information available and 323,432 pairs of mapped
chemicalchemical interaction information extracted from the STITCH
database. A new chemical entity can have multiple toxicity effects,
so a multiclass toxicity prediction tool may prove to be practically
more valuable to chemists than a traditional binary classification
model. It can provide a better toxicity profile for a compound
rather than merely indicating whether the compound has a
specific toxic action or potential. The outstanding performance of
our approach suggests that the multi-classification scheme is
feasible and effective for in silico chemical toxicity prediction.
Table S1 List of 17,233 compounds investigated in this
study and their toxicity information.
Table S2 List of 323,432 interaction
predict compound toxicity in this study.
units used to
Conceived and designed the experiments: LC JZ MYZ YDC. Performed
the experiments: LC JL KRF. Analyzed the data: JL JZ MYZ. Contributed
reagents/materials/analysis tools: LC JL KRF MYZ YDC. Wrote the
paper: LC JL KRF.
1. Dubach UC , Rosner B , Sturmer T ( 1991 ) An epidemiologic study of abuse of analgesic drugs . Effects of phenacetin and salicylate on mortality and cardiovascular morbidity (1968 to 1987 ). N Engl J Med 324 : 155 - 160 .
2. ''AstraZeneca Decides to Withdraw Exanta'' ( 2006 ) Available: http://www. astrazeneca.com/Media/Press-releases/ Article/20060214-AstraZenecaDecides-to-Withdraw-Exanta . Accessed 2012 Sep 2.
3. Wang WB , Zhao YP , Cong L , Jing H , Liao Q , et al. ( 2011 ) Clinical characters of gastrointestinal lesions in intestinal Behcet's disease . Chin Med Sci J 26 : 168 - 171 .
4. Zheng M , Liu Z , Xue C , Zhu W , Chen K , et al. ( 2006 ) Mutagenic probability estimation of chemical compounds by a novel molecular electrophilicity vector and support vector machine . Bioinformatics 22 : 2099 - 2106 .
5. Wang Y , Lu J , Wang F , Shen Q , Zheng M , et al. ( 2012 ) Estimation of carcinogenicity using molecular fragments tree . J Chem Inf Model 52 : 1994 - 2003 .
6. Crammer K , Singer Y ( 2001 ) On the algorithmic implementation of multiclass kernel-based vector machines . Journal of Machine Learning Research 2 : 265 - 292 .
7. Dietterich TG , Bakiri G ( 1995 ) Solving multiclass learning problems via errorcorrecting output codes . Journal of Artificial Intelligence Research 2 : 263 - 286 .
8. Sharan R , Ulitsky I , Shamir R ( 2007 ) Network-based prediction of protein function . Mol Syst Biol 3 : 88 .
9. Bogdanov P , Singh AK ( 2010 ) Molecular function prediction using neighborhood features . IEEE/ACM Trans Comput Biol Bioinform 7 : 208 - 217 .
10. Kourmpetis YA , van Dijk AD , Bink MC , van Ham RC , ter Braak CJ ( 2010 ) Bayesian Markov Random Field analysis for protein function prediction based on network data . PLoS One 5 : e9293 .
11. Ng KL , Ciou JS , Huang CH ( 2010 ) Prediction of protein functions based on function-function correlation relations . Comput Biol Med 40 : 300 - 305 .
12. Hu L , Huang T , Liu XJ , Cai YD ( 2011 ) Predicting protein phenotypes based on protein-protein interaction network . PLoS One 6 : e17668 .
13. Hu L , Huang T , Shi X , Lu WC , Cai YD , et al. ( 2011 ) Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties . PLoS One 6 : e14556 .
14. Gao P , Wang QP , Chen L , Huang T ( 2012 ) Prediction of Human Genes Regulatory Functions Based on Proteinprotein Interaction Network . Protein and Peptide Letters 19 : 910 - 916 .
15. Hu LL , Chen C , Huang T , Cai YD , Chou KC ( 2011 ) Predicting Biological Functions of Compounds Based on Chemical-Chemical Interactions . PLoS ONE 6 : e29491 .
16. Chen L , Zeng WM , Cai YD , Feng KY , Chou KC ( 2012 ) Predicting Anatomical Therapeutic Chemical (ATC) Classification of Drugs by Integrating ChemicalChemical Interactions and Similarities. PLoS ONE 7: e35254 .
17. Kuhn M , von Mering C , Campillos M , Jensen LJ , Bork P ( 2008 ) STITCH: interaction networks of chemicals and proteins . Nucleic Acids Res 36 : D684 - 688 .
18. Kuhn M , Szklarczyk D , Franceschini A , Campillos M , von Mering C , et al. ( 2010 ) STITCH 2: an interaction network database for small molecules and proteins . Nucleic Acids Res 38 : D552 - 556 .
19. Accelrys Toxicity Database 2011 .4. Accelrys Software Inc.: San Diego, CA.
20. DrugBank . Available: http://www.drugbank.ca/downloads. Accessed 2012 Sep 2.
21. HMDB. Available: http://www.hmdb.ca/downloads. Accessed 2012 Sep 2.
22. Du P , Li T , Wang X ( 2011 ) Recent progress in predicting protein subsubcellular locations . Expert Review of Proteomics 8 : 391 - 404 .
23. Cai YD , Lu L , Chen L , He JF ( 2010 ) Predicting subcellular location of proteins using integrated-algorithm method . Molecular Diversity 14 : 551 - 558 .
24. Shao X , Tian Y , Wu L , Wang Y , Jing L , et al. ( 2009 ) Predicting DNA-and RNA-binding proteins from sequences with kernel methods . Journal of Theoretical Biology 258 : 289 - 293 .
25. Zeng Y , Guo Y , Xiao R , Yang L , Yu L , et al. ( 2009 ) Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach . Journal of Theoretical Biology 259 : 366 - 372 .
26. Chen L , Cai YD , Shi XH , Huang T ( 2012 ) Analysis of Metabolic Pathway Using Hybrid Properties . Protein and Peptide Letters 19 : 99 - 107 .
27. Esmaeili M , Mohabatkar H , Mohsenzadeh S ( 2010 ) Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses . Journal of theoretical biology 263 : 203 - 209 .
28. Georgiou D , Karakasidis T , Nieto J , Torres A ( 2009 ) Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition . Journal of theoretical biology 257 : 17 - 26 .
29. Li BQ , Hu LL , Chen L , Feng KY , Cai YD , et al. ( 2012 ) Prediction of Protein Domain with mRMR Feature Selection and Analysis . PLoS ONE 7 : e39308 .
30. Jin L , Fang W , Tang H ( 2003 ) Prediction of protein structural classes by a new measure of information discrepancy . Computational Biology and Chemistry 27 : 373 - 380 .
31. Ivanciuc O ( 2008 ) Weka machine learning for predicting the phospholipidosis inducing potential . Current Topics in Medicinal Chemistry 8 : 1691 - 1709 .
32. Ravetti MG , Moscato P ( 2008 ) Identification of a 5-protein biomarker molecular signature for predicting Alzheimer's disease . PLoS ONE 3 : e3111 .
33. Sun XD , Huang RB ( 2006 ) Prediction of protein structural classes using support vector machines . Amino acids 30 : 469 - 475 .
34. Yuan JM , Koh WP , Murphy SE , Fan Y , Wang R , et al. ( 2009 ) Urinary levels of tobacco-specific nitrosamine metabolites in relation to lung cancer development in two prospective cohorts of cigarette smokers . Cancer Res 69 : 2990 - 2995 .
35. Kitiporn P , Jan-Phillip M , Marcus O , Fabian S , Victor S , et al. ( 2008 ) Machine learning based analyses on metabolic networks supports high-throughput knockout screens . BMC Systems Biology 2 : 67 .
36. Church TR , Anderson KE , Caporaso NE , Geisser MS , Le CT , et al. ( 2009 ) A prospectively measured serum biomarker for a tobacco-specific carcinogen and lung cancer in smokers . Cancer Epidemiol Biomarkers Prev 18 : 260 - 266 .
37. Benigni R , Bossa C ( 2011 ) Mechanisms of chemical carcinogenicity and mutagenicity: a review with implications for predictive toxicology . Chem Rev 111 : 2507 - 2536 .
38. Benigni R , Bossa C ( 2008 ) Structure alerts for carcinogenicity, and the Salmonella assay system: a novel insight through the chemical relational databases technology . Mutat Res 659 : 248 - 261 .
39. Arcos JC , Argus MF, editors ( 1995 ) Multifactor interaction network of carcinogenesis - a ''tour guide'' . Boston: Birkhauser. 1 - 20 p.
40. DrugBank . Available: http://www.drugbank.ca/drugs/DB01349. Accessed 2012 Sep 12.
41. Patlewicz GY , Wright ZM , Basketter DA , Pease CK , Lepoittevin JP , et al. ( 2002 ) Structure-activity relationships for selected fragrance allergens . Contact Dermatitis 47 : 219 - 226 .
42. Woo YT ( 2003 ) Mechanisms of action of chemical carcinogens, and their role in structure-activity relationships (SAR) analysis and risk assessment . In: Benigni R, editor. Quantitative Structure-Activity Relationship (QSAR) Models of Mutagens and Carcinogens . Boca Raton : CRC Press. 41 - 80 .