A control-theoretic perspective on optimal high-order optimization
Mathematical Programming
https://doi.org/10.1007/s10107-021-01721-3
FULL LENGTH PAPER
Series A
A control-theoretic perspective on optimal high-order
optimization
Tianyi Lin1
· Michael I. Jordan1,2
Received: 21 December 2019 / Accepted: 4 October 2021
© The Author(s) 2021
Abstract
We provide a control-theoretic perspective on optimal tensor algorithms for minimizing a convex function in a finite-dimensional Euclidean space. Given a function Φ :
Rd → R that is convex and twice continuously differentiable, we study a closed-loop
control system that is governed by the operators ∇Φ and ∇ 2 Φ together with a feedback control law λ(·) satisfying the algebraic equation (λ(t)) p ∇Φ(x(t)) p−1 = θ
for some θ ∈ (0, 1). Our first contribution is to prove the existence and uniqueness
of a local solution to this system via the Banach fixed-point theorem. We present a
simple yet nontrivial Lyapunov function that allows us to establish the existence and
uniqueness of a global solution under certain regularity conditions and analyze the
convergence properties of trajectories. The rate of convergence is O(1/t (3 p+1)/2 ) in
terms of objective function gap and O(1/t 3 p ) in terms of squared gradient norm.
Our second contribution is to provide two algorithmic frameworks obtained from discretization of our continuous-time system, one of which generalizes the large-step
A-HPE framework of Monteiro and Svaiter (SIAM J Optim 23(2):1092–1125, 2013)
and the other of which leads to a new optimal p-th order tensor algorithm. While our
discrete-time analysis can be seen as a simplification and generalization of Monteiro
and Svaiter (2013), it is largely motivated by the aforementioned continuous-time
analysis, demonstrating the fundamental role that the feedback control plays in optimal acceleration and the clear advantage that the continuous-time perspective brings
to algorithmic design. A highlight of our analysis is that we show that all of the p-th
order optimal tensor algorithms that we discuss minimize the squared gradient norm
at a rate of O(k −3 p ), which complements the recent analysis in Gasnikov et al. (in:
COLT, PMLR, pp 1374–1391, 2019), Jiang et al. (in: COLT, PMLR, pp 1799–1801,
2019) and Bubeck et al. (in: COLT, PMLR, pp 492–507, 2019).
B Tianyi Lin
Michael I. Jordan
1
Department of Electrical Engineering and Computer Science, UC Berkeley, Berkeley, USA
2
Department of Statistics, UC Berkeley, Berkeley, USA
123
T. Lin, M. I. Jordan
Keywords Convex optimization · Optimal acceleration · Closed-loop control
system · Feedback control · High-order tensor algorithm · Iteration complexity
Mathematics Subject Classification 37N40 · 90C25 · 90C60 · 49M37 · 68Q25
1 Introduction
The interplay between continuous-time and discrete-time perspectives on dynamical
systems has made a major impact on optimization theory. Classical examples include
(1) the interpretation of steepest descent, heavy ball and proximal algorithms as the
explicit and implicit discretization of gradient-like dissipative systems [4,5,10,24,
25,98]; and (2) the explicit discretization of Newton-like and Levenberg–Marquardt
regularized systems [1,6,7,12,26–28,32–34,79], which give standard and regularized
Newton algorithms. One particularly salient way that these connections have spurred
research is via the use of Lyapunov functions to transfer asymptotic behavior and rates
of convergence between continuous time and discrete time.
Recent years have witnessed a flurry of new research focusing on continuous-time
perspectives on Nesterov’s accelerated gradient algorithm (NAG) [95] and related
methods [38,67,90,108]. These perspectives arise from derivations that obtain differential equations as limits of discrete dynamics [29,30,56,74,86,101,102,106,109],
including quasi-gradient formulations and Kurdyka-Lojasiewicz theory [14,39] (see
[36,37,52,53,69] for geometrical perspective on the topic), inertial gradient systems
with constant or asymptotic vanishing damping [15,20,21,106] and their extension to
maximally monotone operators [16,17,45], Hessian-driven damping [6,13,18,28,31,
46,102], time scaling [13,19,21,22], dry friction damping [2,3], closed-loop damping
[13,14], control-theoretic design [58,68,77] and Lagrangian and Hamiltonian frameworks [40,55,59,60,78,87,96,110]. Examples of hitherto unknown results that have
arisen from this line of research include the fact that NAG achieves a fast rate of
o(k −2 ) in terms of objective function gap [20,29,83] and O(k −3 ) in terms of squared
gradient norm [102].
The introduction of the Hessian-driven damping into continuous-time dynamics
has been a particular milestone in optimization and mechanics. The precursor of
this perspective can be found in the variational characterization of the Levenberg–
Marquardt method and Newton’s method [7], a development that inspired work on
continuous-time Newton-like approaches for convex minimization [7,32] and monotone inclusions [1,12,26,27,33,34,79]. Building on these works, [6] distinguished
Hessian-driven damping from classical continuous Newton formulations and showed
its importance in optimization and mechanics. Subsequently, [31] demonstrated the
connection between Hessian-driven damping and the forward-backward algorithms
in Nesterov acceleration (e.g., FISTA), and combined Hessian-driven damping with
asymptotically vanishing damping [106]. The resulting dynamics takes the following
form:
ẍ(t) +
123
α
ẋ(t) + β∇ 2 Φ(x(t))ẋ(t) + ∇Φ(x(t)) = 0,
t
(1)
A control-theoretic perspective on optimal...
where it is worth mentioning that the presence of the Hessian does not entail numerical
difficulties since it arises in the form ∇ 2 Φ(x(t))ẋ(t), which is the time derivative of the
function t → ∇Φ(x(t)). Further work in this vein appeared in [102], where Nesterov
acceleration was interpreted via multiscale limits that yield high-resolution differential
equations:
√
√
3
3 s
∇Φ(x(t)) = 0.
ẍ(t) + ẋ(t) + s∇ 2 Φ(x(t))ẋ(t) + 1 +
t
2t
(2)
These limits were used in particular to distinguish between Polyak’s heavy-ball method
and NAG, which are not distinguished by naive limiting arguments that yield the same
differential equation for both.
Althought the coefficients are different in Eqs. (1) and (2), both contain Hessiandriven damping, which corresponds to a correction term obtained via discretization,
and which provides fast convergence to zero of the gradients and reduces the oscillatory
aspects. Using this viewpoint, several subtle analyses have been recently provided in
work independent of ours [13,14]. In particular, they develop a convergence theory
for a general inertial system with asymptotic vanishing damping and Hessian-driven
damping. Under certain conditions, the fast convergence is guaranteed in terms of both
objective function gap and squared gradient norm. Beyond the aforementioned line of
work, however, most of the focus in using continuous-time perspectives to shed light
on acceleration has been (...truncated)