From Kernel Methods to Neural Networks: A Unifying Variational Formulation (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s10208-023-09624-9.pdf

From Kernel Methods to Neural Networks: A Unifying Variational Formulation

Foundations of Computational Mathematics https://doi.org/10.1007/s10208-023-09624-9 From Kernel Methods to Neural Networks: A Unifying Variational Formulation Michael Unser1 Received: 29 June 2022 / Revised: 14 February 2023 / Accepted: 20 July 2023 © The Author(s) 2023 Abstract The minimization of a data-fidelity term and an additive regularization functional gives rise to a powerful framework for supervised learning. In this paper, we present a unifying regularization functional that depends on an operator L and on a generic Radon-domain norm. We establish the existence of a minimizer and give the parametric form of the solution(s) under very mild assumptions. When the norm is Hilbertian, the proposed formulation yields a solution that involves radial-basis functions and is compatible with the classical methods of machine learning. By contrast, for the total-variation norm, the solution takes the form of a two-layer neural network with an activation function that is determined by the regularization operator. In particular, we retrieve the popular ReLU networks by letting the operator be the Laplacian. We also characterize the solution for the intermediate regularization norms · = · L p with p ∈ (1, 2]. Our framework offers guarantees of universal approximation for a broad family of regularization operators or, equivalently, for a wide variety of shallow neural networks, including the cases (such as ReLU) where the activation function is increasing polynomially. It also explains the favorable role of bias and skip connections in neural architectures. Keywords Machine learning · Convex optimization · Regularization · Representer theorem · Kernel methods · Neural networks · Banach space Mathematics Subject Classification 44A12 · 46N10 · 47A52 · 68T07 Communicated by Carola-Bibiane Schönlieb. The research leading to these results has received funding from the European Research Council under Grant ERC-2020-AdG FunLearn-101020573. B Michael Unser 1 Biomedical Imaging Group, École polytechnique fédérale de Lausanne (EPFL), Station 17, 1015 Lausanne, Switzerland 123 Foundations of Computational Mathematics 1 Introduction Regularization theory constitutes a powerful framework for the derivation of algorithms for supervised learning [14, 41, 42]. Given a series of data points (x m , ym ) ∈ Rd × R, m = 1, . . . , M, the basic problem (regression) is to find a mapping f : Rd → R such that f (x m ) ≈ ym , without overfitting. The standard paradigm is to let f be the minimizer of a cost that consists of a data-fidelity term and an additive regularization functional [8]. The minimization proceeds over a prescribed class H of candidate functions. One usually distinguishes between the parametric approaches (e.g., neural networks), where H = HΘ is a family of functions specified by a finite set of parameters θ ∈ Θ (e.g., the weights of the network), and the nonparametric ones, where the properties of the solution are controlled by the regularization functional. The focus of this paper is on the nonparametric techniques. They rely on functional optimization, which means that the minimization proceeds over a space of functions rather than over a set of parameters. The regularization is usually chosen to be an increasing function of the norm associated with a particular Banach space, which results in a well-posed problem [9, 10, 56]. The functional-optimization point of view is often constructive, in that it suggests or supports explicit learning architectures. For instance, the choice of the Hilbertian regularization R( f ) = f 2H where H is a reproducing-kernel Hilbert space (RKHS) results in a closed-form solution that is a linear combination of kernels positioned on the data [7, 62]. In fact, the RKHS setting yields a generic class of estimators that is compatible with the classical kernel-based methods of machine learning, including support vector machines [1, 41, 49, 50, 56, 62]. Likewise, adaptive kernel methods are justifiable from the minimization of a generalized total-variation norm, which favors sparse representations [3, 11, 12]. These latter results actually take their root in spline theory [18, 28, 60]. Similarly, it has been demonstrated that shallow ReLU networks are solutions of functional-optimization problems with an appropriate regularization. One way to achieve this is to start from an explicit parameterization of an infinite-width network [4] (the reverse engineering/synthesis approach). Another way is to consider a regularization operator that is matched to the neuronal activation with a L 1 -type penalty1 ; for instance, a second-order derivative for d = 1 [36, 48] or, more generally, the Radon-domain counterpart of the Laplace operator whose Green’s function is precisely a ReLU ridge [35, 37, 57]. Similar optimality results can be stated within the framework of reproducing-kernel Banach spaces [6], which is a formal point of view that bridges the synthesis and analysis approach of [4] and [37], respectively. Also relevant to the discussion is a variational formulation that links the ridgelet transform to the training of shallow neural networks with weight-decay regularization [53]. The second important benefit of the functional-optimization approach is that it gives insight on the approximation capabilities (expressivity) of the resulting learning architectures. This information is encapsulated in the definition of the native space H (typically, a Sobolev space), which goes hand-in-hand with the regularization functional. Roughly speaking, the native space H ought to be “large enough” to allow for 1 The precise formulation involves the M-norm (or total variation), which is the weak form of L associated 1 with the space of bounded Radon measures. In our account, we take it as the default norm for the Lebesgue exponent p = 1, with a slight abuse of language. 123 Foundations of Computational Mathematics the approximation of any continuous function with an arbitrary degree of precision. This universal approximation property is a central theme in the theory of radial-basis functions (RBFs) [31, 63]. In machine learning, the kernel estimators that meet this approximation requirement are called universal [32]. When the basis functions are shifted replicates of a single template h : Rd → R, then the condition is equivalent to h being strictly positive definite, which means that its Fourier transform is real-valued symmetric, and (strictly) positive [13]. Similar guarantees of universal approximation exist for (shallow) neural networks under mild conditions on the activation functions [5, 16, 25, 30, 39]. The main difference with the RKHS framework, however, is that the universality results for neural nets usually make the assumption that the input domain is a compact subset of Rd . The purpose of this paper is to unify and extend these various approaches by introducing a universal regularization functional. The latter has two components: an (...truncated)