A Decision-Theoretic Approach to Model Choice (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s40745-025-00589-w.pdf

A Decision-Theoretic Approach to Model Choice

Annals of Data Science https://doi.org/10.1007/s40745-025-00589-w ORIGINAL ARTICLE A Decision-Theoretic Approach to Model Choice Markku Karhunen1 Received: 21 January 2024 / Revised: 2 January 2025 / Accepted: 17 January 2025 © The Author(s) 2025 Abstract Model choice algorithms are usually compared based on their accuracy, i.e. ability to find true models. However, conservative algorithms (such as BIC minimisation) are accurate when no true effects exist, while more liberal algorithms (such as Lasso) are accurate when there are plenty of true effects. There is ambiguity, then, regarding the correct algorithm. The purpose of this paper is to show how expected utility maximisation and Monte Carlo simulations can be used to compare model choice algorithms. Two loss functions are derived from the expected utility function of the researcher. Both loss functions turn out to be linear combinations of specificity and one or two kinds of sensitivity which are discussed in this paper. Subsequently, this paper experiments with four parametrisations of these loss functions, and then uses these parametrised versions to compare nine algorithms within the contexts of both logistic and Gaussian regression. The results demonstrate that researchers who avoid false positives should either use BIC or BICc for model choice or report nothing at all. AIC does not seem to be the optimal method for the range of parameters covered in this study. Keywords Model choice · Gaussian · Logistic · Expected utility · Loss function 1 Introduction Model choice can have a number of meanings in statistics. For example, model choice could mean a choice between multiplicative and additive models. However, the term usually refers to the choice of covariates within a regression model. The information criteria established by Akaike [1] and Schwarz [2] are the classic tools in this domain. Another prominent method is the Lasso [3], which includes all covariates within the B 1 Markku Karhunen Built Environment Solutions Unit, Finnish Environment Institute (Syke), Jyväskylän Toimipaikka, Survontie 9A, 40500 Jyväskylä, Finland 123 Annals of Data Science model and attempts to force some coefficients to zero by using a penalty function, thereby producing a parsimonious model. Both AIC and BIC have been specifically adapted for small sample sizes [4, 5], yielding formulas that asymptotically converge towards the original versions of these statistics. On the other hand, the basic principle of the Lasso has been applied to various other problems, such as precision matrix estimation [6], multi-response regression [7], and multilevel medical data [8]. However, the focus of this paper is the simple regression problem: how to choose the right covariates for a scalar response variable. Karhunen recently compared the performance of nine model choice methods within the context of logistic regression [9]. This comparison was based on a loss function that was defined as a linear combination of sensitivity and specificity. While intuitively appealing, this loss function was not justified by any derivation or proof. Here, however, this loss function and its generalisation are derived from a utility-maximisation problem. This method is also applied to linear model choice tasks. The theoretical framework of this paper is the expected utility maximisation, also known as Von Neumann-Morgenstern utility [10]. To summarise, the idea is that a rational agent should account for all possible world states and weigh them according to their probability. Utility maximisation is a widely accepted paradigm in microeconomics and decision theory, and the term ‘expected utility’ is used when there is uncertainty regarding the potential outcome of any or all actions. Expected utility maximisation has been applied to problems as diverse as traffic behaviour [11], strategic management [12], oncology [13], and strategic deterrence [14]. Perhaps the closest point of comparison to the present study is a framework where expected utility maximisation was used to determine the optimal threshold of a clinical test [15]. With this kind of application, there is a trade-off between sensitivity and specificity. In clinical testing, sensitivity means the probability to correctly label affected individuals, while specificity means the probability to correctly label the unaffected individuals. In model choice, sensitivity means the power to detect true covariates, and specificity means the power to avoid false covariates in the model equation. There is a trade-off between sensitivity and specificity in this domain as well [9]. Intuitively, the choice of method depends on whether the researchers require results (thus preferring high sensitivity), or whether they want to avoid false results (thus preferring high specificity). The innovation of this particular paper is to formalise this trade-off in terms of expeced utility maximisation. In the next section, the two loss functions are derived, followed by the introduction of two practical applications for logistic and Gaussian regression models. The results are presented in Sect. 3, followed by analysis and drawing conclusions in Sect. 4. 2 Material and Methods Let us assume that there is a true effect in the data with probability π , and let us assume that it can be detected with probability p. Let us also assume that noise effects are incorrectly included in the model with probability q. 123 Annals of Data Science One may calculate the probability to detect the true effect and nothing but the true effect, i.e. sensitivity-1 (Sens1), the probability to detect the true effect, i.e. sensitivity2 (Sens2), and the probability to avoid false covariates, i.e. specificity (Spec) [9]. From these definitions, it follows that these quantities are: Sens1 p(1 − q), (1) Sens2 p, (2) Spec 1 − q. (3) For any data-generating process and model choice algorithm, these quantities may be estimated from Monte Carlo simulations, but the researcher needs a utility function to rank the different algorithms. Below, two different utility functions are introduced, yielding loss functions 1 and 2 which are linear functions of Sens1, Sens2 and Spec. 2.1 Loss Function 1 Let us assume that the payoff for the researchers does not depend on the noise covariates if they detect a true effect, but that they try to avoid reporting anything if no true covariate exists. The expected utility of this type of researcher is given by U π pu 1 + π (1 − p)u 2 + (1 − π )qu 3 + (1 − π )(1 − q)u 4 (4) where u 1 is the payoff in the case that they correctly detect a true effect, u 2 is the payoff if they fail to detect a true effect, u 3 is the payoff if they detect a false effect, and u 4 is the payoff if they do not detect anything and there is no effect in the data. It can be assumed that u 1 > u 2 and u 4 > u 3 . From Eq. (4), it follows that U π pu 1 − π pu 2 + (1 − π )qu 3 − (1 − π )qu 4 + π u 2 + 1 − π . (5) Above, only p (...truncated)