Logic models to predict continuous outputs based on binary inputs with an application to personalized cancer therapy (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/srep36812.pdf

Logic models to predict continuous outputs based on binary inputs with an application to personalized cancer therapy

www.nature.com/scientificreports OPEN received: 25 May 2016 accepted: 27 September 2016 Published: 23 November 2016 Logic models to predict continuous outputs based on binary inputs with an application to personalized cancer therapy Theo A. Knijnenburg1, Gunnar W. Klau2, Francesco Iorio3, Mathew J. Garnett4, Ultan McDermott4, Ilya Shmulevich1 & Lodewyk F. A. Wessels5 Mining large datasets using machine learning approaches often leads to models that are hard to interpret and not amenable to the generation of hypotheses that can be experimentally tested. We present ‘Logic Optimization for Binary Input to Continuous Output’ (LOBICO), a computational approach that infers small and easily interpretable logic models of binary input features that explain a continuous output variable. Applying LOBICO to a large cancer cell line panel, we find that logic combinations of multiple mutations are more predictive of drug response than single gene predictors. Importantly, we show that the use of the continuous information leads to robust and more accurate logic models. LOBICO implements the ability to uncover logic models around predefined operating points in terms of sensitivity and specificity. As such, it represents an important step towards practical application of interpretable logic models. Regression and classification models are important tools for researchers in various fields. The application of these many-to-one mapping models is two-fold. First, they can be used for prediction. The output value or class of a (new) case can be predicted by applying the inferred mapping to the input variables of the case. Second, they inform us about the relationship between the input and the output. They specify how the input variables are (mathematically) interacting with each other to produce the output variable. The usefulness of the second application is, however, limited by the power of the human intellect. We suggest that the interpretation of these many-to-one mapping models is of utmost, yet undervalued, importance in many research fields. This also holds for computational biology, where a multitude of molecular and genomic data is frequently used to explain or predict a biological or clinical phenotype. Single predictor models are generally not accurate enough, reflecting the importance of acknowledging the interaction between biological components. On the other hand, machine learning approaches, such as Elastic Net1 and Random Forests2 produce complex multi-predictor models that are hard to interpret and not amenable to the generation of hypotheses that can be experimentally tested. As a consequence, such models are not likely to further our understanding of biology. There is an urgent need for approaches that build small, interpretable, yet accurate models that capture the interplay between biological components and explain the phenotype of interest. In this study, we have developed such a modeling framework to explain drug response of cancer cell lines using gene mutation data. Our approach, ‘Logic Optimization for Binary Input to Continuous Output’ (LOBICO) infers small and easily interpretable logic models of gene mutations (binary input variables) that explain the observed sensitivity to anticancer drugs in the cell lines (continuous output). The contributions of our approach are three-fold: First, the continuous information of the output variable is retained in the logic mapping. The output variable is binarized, which facilitates its interpretation, yet the distances of the continuous values to the binarization threshold are used in the inference. Second, LOBICO provides the user with the option to include constraints on the model performance that allows the identification of logic models around operating points predefined in terms of sensitivity and specificity. This enables tailoring of the 1 Institute for Systems Biology, Seattle, US. 2Centrum Wiskunde & Informatica, Amsterdam, The Netherlands. European Molecular Biology Laboratory - European Bioinformatics Institute, UK. 4Wellcome Trust Sanger Institute, UK. 5Netherlands Cancer Institute, Amsterdam, and The Faculty of EEMCS, Delft University of Technology, Delft, The Netherlands. Correspondence and requests for materials should be addressed to L.F.A.W. (email: ) 3 Scientific Reports | 6:36812 | DOI: 10.1038/srep36812 1 www.nature.com/scientificreports/ model to, for example, clinical applications where the severity of diseases or side effects of the treatment dictate a desired level of specificity or sensitivity. Third, the logic mapping is formulated as an integer linear programming problem (ILP). This means that advanced ILP solvers can be used to find an optimal logic mapping fast enough to apply LOBICO to large and complex datasets without the need to tune parameters. Our work is similar in spirit to logic regression (LR)3,4, sparse combinatorial inference (SCI)5, Markov logic networks6,7, combinatorial association logic (CAL)8, CellNetOptimizer9 and genetic programming for association studies (GPAS)10, which all employ combinatorial logic to explicitly incorporate interactions in their models. The most important aspect in which LOBICO differentiates itself from these approaches is by its direct emphasis on interpretability. This is in contrast with the linearly weighted sums of logic functions as inferred by LR or the posterior probabilities of predictors in the model averaged across an ensemble of many solutions as inferred by SCI. Graphical models, such as Bayesian networks11 and Markov random fields12 also facilitate interpretation, although due to their probabilistic nature they do not lend themselves to standard formal reasoning as well as logic models do. MOCA (Multivariate Organization of Combinatorial Alterations)13 deserves special attention as it has also been applied to predict drug response by inferring logic combinations of genomic input features. The most important differences with our work are: (1) MOCA employs a heuristic, sub-optimal progressive selection of features to infer logic formulas, and (2) MOCA uses discretized drug response values and discards the information in the continuous values that LOBICO uses in its model inference. Moreover, LOBICO includes constraints on statistical performance criteria, such as a minimum specificity, which is a novel feature not available in any other approach. Here, we demonstrate LOBICO by application to a large cancer cell line panel, where the goal is to explain drug response based on binary mutation data of a set of genes14. We investigate whether logic models perform better than single-gene predictors, and put genes that co-occur in logic models in the context of known cancer pathways. We assess whether using continuous output values provides benefit in terms of robustness and performance above the use of binarized data, which is usually the starting point for logical analysis of data15. We also provide a comparison with Elastic Net, Random For (...truncated)