From Kernel Methods to Neural Networks: A Unifying Variational Formulation
Foundations of Computational Mathematics
https://doi.org/10.1007/s10208-023-09624-9
From Kernel Methods to Neural Networks: A Unifying
Variational Formulation
Michael Unser1
Received: 29 June 2022 / Revised: 14 February 2023 / Accepted: 20 July 2023
© The Author(s) 2023
Abstract
The minimization of a data-fidelity term and an additive regularization functional
gives rise to a powerful framework for supervised learning. In this paper, we present
a unifying regularization functional that depends on an operator L and on a generic
Radon-domain norm. We establish the existence of a minimizer and give the parametric
form of the solution(s) under very mild assumptions. When the norm is Hilbertian,
the proposed formulation yields a solution that involves radial-basis functions and
is compatible with the classical methods of machine learning. By contrast, for the
total-variation norm, the solution takes the form of a two-layer neural network with
an activation function that is determined by the regularization operator. In particular,
we retrieve the popular ReLU networks by letting the operator be the Laplacian. We
also characterize the solution for the intermediate regularization norms · = · L p
with p ∈ (1, 2]. Our framework offers guarantees of universal approximation for a
broad family of regularization operators or, equivalently, for a wide variety of shallow
neural networks, including the cases (such as ReLU) where the activation function is
increasing polynomially. It also explains the favorable role of bias and skip connections
in neural architectures.
Keywords Machine learning · Convex optimization · Regularization · Representer
theorem · Kernel methods · Neural networks · Banach space
Mathematics Subject Classification 44A12 · 46N10 · 47A52 · 68T07
Communicated by Carola-Bibiane Schönlieb.
The research leading to these results has received funding from the European Research Council under
Grant ERC-2020-AdG FunLearn-101020573.
B Michael Unser
1
Biomedical Imaging Group, École polytechnique fédérale de Lausanne (EPFL), Station 17,
1015 Lausanne, Switzerland
123
Foundations of Computational Mathematics
1 Introduction
Regularization theory constitutes a powerful framework for the derivation of algorithms for supervised learning [14, 41, 42]. Given a series of data points (x m , ym ) ∈
Rd × R, m = 1, . . . , M, the basic problem (regression) is to find a mapping
f : Rd → R such that f (x m ) ≈ ym , without overfitting. The standard paradigm
is to let f be the minimizer of a cost that consists of a data-fidelity term and an
additive regularization functional [8]. The minimization proceeds over a prescribed
class H of candidate functions. One usually distinguishes between the parametric
approaches (e.g., neural networks), where H = HΘ is a family of functions specified
by a finite set of parameters θ ∈ Θ (e.g., the weights of the network), and the nonparametric ones, where the properties of the solution are controlled by the regularization
functional. The focus of this paper is on the nonparametric techniques. They rely on
functional optimization, which means that the minimization proceeds over a space of
functions rather than over a set of parameters. The regularization is usually chosen
to be an increasing function of the norm associated with a particular Banach space,
which results in a well-posed problem [9, 10, 56].
The functional-optimization point of view is often constructive, in that it suggests
or supports explicit learning architectures. For instance, the choice of the Hilbertian
regularization R( f ) = f 2H where H is a reproducing-kernel Hilbert space (RKHS)
results in a closed-form solution that is a linear combination of kernels positioned on
the data [7, 62]. In fact, the RKHS setting yields a generic class of estimators that
is compatible with the classical kernel-based methods of machine learning, including
support vector machines [1, 41, 49, 50, 56, 62]. Likewise, adaptive kernel methods are
justifiable from the minimization of a generalized total-variation norm, which favors
sparse representations [3, 11, 12]. These latter results actually take their root in spline
theory [18, 28, 60]. Similarly, it has been demonstrated that shallow ReLU networks
are solutions of functional-optimization problems with an appropriate regularization.
One way to achieve this is to start from an explicit parameterization of an infinite-width
network [4] (the reverse engineering/synthesis approach). Another way is to consider
a regularization operator that is matched to the neuronal activation with a L 1 -type
penalty1 ; for instance, a second-order derivative for d = 1 [36, 48] or, more generally,
the Radon-domain counterpart of the Laplace operator whose Green’s function is
precisely a ReLU ridge [35, 37, 57]. Similar optimality results can be stated within the
framework of reproducing-kernel Banach spaces [6], which is a formal point of view
that bridges the synthesis and analysis approach of [4] and [37], respectively. Also
relevant to the discussion is a variational formulation that links the ridgelet transform
to the training of shallow neural networks with weight-decay regularization [53].
The second important benefit of the functional-optimization approach is that it
gives insight on the approximation capabilities (expressivity) of the resulting learning
architectures. This information is encapsulated in the definition of the native space H
(typically, a Sobolev space), which goes hand-in-hand with the regularization functional. Roughly speaking, the native space H ought to be “large enough” to allow for
1 The precise formulation involves the M-norm (or total variation), which is the weak form of L associated
1
with the space of bounded Radon measures. In our account, we take it as the default norm for the Lebesgue
exponent p = 1, with a slight abuse of language.
123
Foundations of Computational Mathematics
the approximation of any continuous function with an arbitrary degree of precision.
This universal approximation property is a central theme in the theory of radial-basis
functions (RBFs) [31, 63]. In machine learning, the kernel estimators that meet this
approximation requirement are called universal [32]. When the basis functions are
shifted replicates of a single template h : Rd → R, then the condition is equivalent to
h being strictly positive definite, which means that its Fourier transform is real-valued
symmetric, and (strictly) positive [13]. Similar guarantees of universal approximation
exist for (shallow) neural networks under mild conditions on the activation functions
[5, 16, 25, 30, 39]. The main difference with the RKHS framework, however, is that the
universality results for neural nets usually make the assumption that the input domain
is a compact subset of Rd .
The purpose of this paper is to unify and extend these various approaches by
introducing a universal regularization functional. The latter has two components: an (...truncated)