Logic models to predict continuous outputs based on binary inputs with an application to personalized cancer therapy
www.nature.com/scientificreports
OPEN
received: 25 May 2016
accepted: 27 September 2016
Published: 23 November 2016
Logic models to predict continuous
outputs based on binary inputs
with an application to personalized
cancer therapy
Theo A. Knijnenburg1, Gunnar W. Klau2, Francesco Iorio3, Mathew J. Garnett4,
Ultan McDermott4, Ilya Shmulevich1 & Lodewyk F. A. Wessels5
Mining large datasets using machine learning approaches often leads to models that are hard to
interpret and not amenable to the generation of hypotheses that can be experimentally tested.
We present ‘Logic Optimization for Binary Input to Continuous Output’ (LOBICO), a computational
approach that infers small and easily interpretable logic models of binary input features that explain
a continuous output variable. Applying LOBICO to a large cancer cell line panel, we find that logic
combinations of multiple mutations are more predictive of drug response than single gene predictors.
Importantly, we show that the use of the continuous information leads to robust and more accurate
logic models. LOBICO implements the ability to uncover logic models around predefined operating
points in terms of sensitivity and specificity. As such, it represents an important step towards practical
application of interpretable logic models.
Regression and classification models are important tools for researchers in various fields. The application of
these many-to-one mapping models is two-fold. First, they can be used for prediction. The output value or class
of a (new) case can be predicted by applying the inferred mapping to the input variables of the case. Second,
they inform us about the relationship between the input and the output. They specify how the input variables
are (mathematically) interacting with each other to produce the output variable. The usefulness of the second
application is, however, limited by the power of the human intellect. We suggest that the interpretation of these
many-to-one mapping models is of utmost, yet undervalued, importance in many research fields.
This also holds for computational biology, where a multitude of molecular and genomic data is frequently used
to explain or predict a biological or clinical phenotype. Single predictor models are generally not accurate enough,
reflecting the importance of acknowledging the interaction between biological components. On the other hand,
machine learning approaches, such as Elastic Net1 and Random Forests2 produce complex multi-predictor models that are hard to interpret and not amenable to the generation of hypotheses that can be experimentally tested.
As a consequence, such models are not likely to further our understanding of biology. There is an urgent need
for approaches that build small, interpretable, yet accurate models that capture the interplay between biological
components and explain the phenotype of interest.
In this study, we have developed such a modeling framework to explain drug response of cancer cell lines
using gene mutation data. Our approach, ‘Logic Optimization for Binary Input to Continuous Output’ (LOBICO)
infers small and easily interpretable logic models of gene mutations (binary input variables) that explain the
observed sensitivity to anticancer drugs in the cell lines (continuous output).
The contributions of our approach are three-fold: First, the continuous information of the output variable is
retained in the logic mapping. The output variable is binarized, which facilitates its interpretation, yet the distances of the continuous values to the binarization threshold are used in the inference. Second, LOBICO provides
the user with the option to include constraints on the model performance that allows the identification of logic
models around operating points predefined in terms of sensitivity and specificity. This enables tailoring of the
1
Institute for Systems Biology, Seattle, US. 2Centrum Wiskunde & Informatica, Amsterdam, The Netherlands.
European Molecular Biology Laboratory - European Bioinformatics Institute, UK. 4Wellcome Trust Sanger Institute,
UK. 5Netherlands Cancer Institute, Amsterdam, and The Faculty of EEMCS, Delft University of Technology, Delft, The
Netherlands. Correspondence and requests for materials should be addressed to L.F.A.W. (email: )
3
Scientific Reports | 6:36812 | DOI: 10.1038/srep36812
1
www.nature.com/scientificreports/
model to, for example, clinical applications where the severity of diseases or side effects of the treatment dictate a
desired level of specificity or sensitivity. Third, the logic mapping is formulated as an integer linear programming
problem (ILP). This means that advanced ILP solvers can be used to find an optimal logic mapping fast enough to
apply LOBICO to large and complex datasets without the need to tune parameters.
Our work is similar in spirit to logic regression (LR)3,4, sparse combinatorial inference (SCI)5, Markov logic
networks6,7, combinatorial association logic (CAL)8, CellNetOptimizer9 and genetic programming for association
studies (GPAS)10, which all employ combinatorial logic to explicitly incorporate interactions in their models. The
most important aspect in which LOBICO differentiates itself from these approaches is by its direct emphasis on
interpretability. This is in contrast with the linearly weighted sums of logic functions as inferred by LR or the posterior probabilities of predictors in the model averaged across an ensemble of many solutions as inferred by SCI.
Graphical models, such as Bayesian networks11 and Markov random fields12 also facilitate interpretation, although
due to their probabilistic nature they do not lend themselves to standard formal reasoning as well as logic models
do. MOCA (Multivariate Organization of Combinatorial Alterations)13 deserves special attention as it has also
been applied to predict drug response by inferring logic combinations of genomic input features. The most important differences with our work are: (1) MOCA employs a heuristic, sub-optimal progressive selection of features
to infer logic formulas, and (2) MOCA uses discretized drug response values and discards the information in the
continuous values that LOBICO uses in its model inference. Moreover, LOBICO includes constraints on statistical
performance criteria, such as a minimum specificity, which is a novel feature not available in any other approach.
Here, we demonstrate LOBICO by application to a large cancer cell line panel, where the goal is to explain drug
response based on binary mutation data of a set of genes14. We investigate whether logic models perform better
than single-gene predictors, and put genes that co-occur in logic models in the context of known cancer pathways. We assess whether using continuous output values provides benefit in terms of robustness and performance
above the use of binarized data, which is usually the starting point for logical analysis of data15. We also provide
a comparison with Elastic Net, Random For (...truncated)