Performance of objective functions and optimisation procedures for parameter estimation in system biology models
Abstract
Mathematical modelling of signalling pathways aids experimental investigation in system and synthetic biology. Ever increasing data availability prompts the development of large dynamic models with numerous parameters. In this paper, we investigate how the number of unknown parameters affects the convergence of three frequently used optimisation algorithms and four objective functions. We compare objective functions that use data-driven normalisation of the simulations with those that use scaling factors. The data-driven normalisation of the simulation approach implies that simulations are normalised in the same way as the data, making both directly comparable. The scaling factor approach, which is commonly used for parameter estimation in dynamic systems, introduces scaling factors that multiply the simulations to convert them to the scale of the data. Here we show that the scaling factor approach increases, compared to data-driven normalisation of the simulations, the degree of practical non-identifiability, defined as the number of directions in the parameter space, along which parameters are not identifiable. Further, the results indicate that data-driven normalisation of the simulations greatly improve the speed of convergence of all tested algorithms when the overall number of unknown parameters is relatively large (74 parameters in our test problems). Data-driven normalisation of the simulations also markedly improve the performance of the non-gradient-based algorithm tested even when the number of unknown parameters is relatively small (10 parameters in our test problems). As the models and the unknown parameters increase in size, the data-driven normalisation of the simulation approach can be the preferred option, because it does not aggravate non-identifiability and allows for obtaining parameter estimates in a reasonable amount of time.
Introduction
Signalling pathways constitute the machinery that cells use to sense, process and transmit information within and between cells. Because of the non-linear nature and the complexity of signalling pathways, mathematical modelling has been used to formalise the current understanding, identify inconsistencies and suggest new hypotheses. Dynamic models, such as ordinary differential equations (ODEs), are among widely used modelling approaches, capturing the quantitative and dynamic nature of cellular signalling pathways.1, 2 Mathematically, ODE models are of the form
$$\frac{{\rm{d}}}{{{\rm{d}}t}}x = f\left( {x,\theta } \right),$$
(1)
where \(f\left( \cdot \right)\) is a nonlinear function of state vector x. The ODE describes how the rate of change dx/dt depends on x and kinetic parameters θ. An ODE solution is a function (x = x(t,θ)) that depends on time and parameters. Successful examples of dynamic modelling include elucidating the decision-making mechanisms in growth factor signalling,3, 4 stress and DNA damage response5, 6 and cell migration.7
With ever-increasing pace of documenting molecular interactions within and between signalling pathways, there is a need to develop larger and more complex mathematical models. While our current knowledge of molecular interactions allows us to derive kinetic equations of a model in a relatively straightforward manner, the associated increase in unknown kinetic parameters presents a challenge. Usually, these parameters are not directly experimentally accessible. Parameter estimation is the process of indirectly estimating the unknown parameter values using measurement data, which requires high-resolution data of time-courses and multiple perturbations.4, 8, 9 The complexity and non-linearity of biological systems render the parameter estimation problem mathematically difficult. Issues arise from both the existence of local minima and non-identifiability. To overcome local minima, heuristic optimisation-based algorithms are used.10, 11 Non-identifiability means that a unique solution to the parameter estimation problem does not exist. Thus, there are many sets of parameter values that fit the data equally well. Non-identifiability can only be overcome by model reformulation or model reduction,12 or by generating additional data, for example by measuring additional variables.13
Usually only a subset of all the internal states (or a function thereof) are measured, referred to as observables \(\tilde y\).9, 14 To describe the relation between the states and observables in the model, a so-called output function is used y = g(x). Based on the simulated (y) and measured (\(\tilde y\)) observables, a parameter estimation problem can be formulated as an optimisation problem, in which the error between measured and simulated values is minimised.9 Mathematically, this error is described using an objective function (sometimes also called the goodness-of-fit function). Several choices for such objective functions exist. Most common are least-squares (LS), chi-square and log-likelihood (LL).15
A multitude of optimisation algorithms exists that estimate parameters of dynamic models.16 A recent comparison found that LSQNONLIN SE (a local gradient-based search algorithm with Latin hypercube restarts) performs best in terms of both accuracy and speed (as measured in the number of function evaluations required to estimate the parameters).17 In particular, this algorithm largely outperformed 14 other algorithms used to solve non-linear optimisation problems, including stochastic algorithms from an Evolutionary Algorithms framework (EvA2)18 and a hybrid stochastic-deterministic algorithm based on scatter search.19 Thus, LSQNONLIN SE has become a popular choice for parameter estimation of systems biology models.20,21,22,23
It is reported that hybrid stochastic-deterministic methods perform better than local gradient-based methods with restarts for complex problems.24, 25 Hybrid methods combine stochastic strategies, which help to escape local minima, with deterministic local strategies, which quickly find local optimal solutions.24, 25 For instance, previously, we used GLSDC,26 a hybrid algorithm that was not included in the above comparison,17 that time-efficiently estimated parameters for a complex model of expression of the transcription factor cFOS.3 However, it is unknown how GLSDC compares to other algorithms, in particular LSQNONLIN SE, which presumably is the currently fastest choice according to ref. 17.
A practical problem arising for any parameter estimation is how to scale the simulated data points to the measured data. This is because the most common type of experimental data are relative data (e.g. western blotting,27, 28 multiplexed Elisa,29 proteomics or RT-qPCR30), which means that the values of the data points are in arbitrary units (au), such as optical densities.31 In contrast, mathematical models carry well-defined units, such as molar concentrations or normalised dimensionless variables.9, 17, 24 For example, quantification of a western blot (...truncated)