A control-theoretic perspective on optimal high-order optimization (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s10107-021-01721-3.pdf

A control-theoretic perspective on optimal high-order optimization

Mathematical Programming https://doi.org/10.1007/s10107-021-01721-3 FULL LENGTH PAPER Series A A control-theoretic perspective on optimal high-order optimization Tianyi Lin1 · Michael I. Jordan1,2 Received: 21 December 2019 / Accepted: 4 October 2021 © The Author(s) 2021 Abstract We provide a control-theoretic perspective on optimal tensor algorithms for minimizing a convex function in a finite-dimensional Euclidean space. Given a function Φ : Rd → R that is convex and twice continuously differentiable, we study a closed-loop control system that is governed by the operators ∇Φ and ∇ 2 Φ together with a feedback control law λ(·) satisfying the algebraic equation (λ(t)) p ∇Φ(x(t)) p−1 = θ for some θ ∈ (0, 1). Our first contribution is to prove the existence and uniqueness of a local solution to this system via the Banach fixed-point theorem. We present a simple yet nontrivial Lyapunov function that allows us to establish the existence and uniqueness of a global solution under certain regularity conditions and analyze the convergence properties of trajectories. The rate of convergence is O(1/t (3 p+1)/2 ) in terms of objective function gap and O(1/t 3 p ) in terms of squared gradient norm. Our second contribution is to provide two algorithmic frameworks obtained from discretization of our continuous-time system, one of which generalizes the large-step A-HPE framework of Monteiro and Svaiter (SIAM J Optim 23(2):1092–1125, 2013) and the other of which leads to a new optimal p-th order tensor algorithm. While our discrete-time analysis can be seen as a simplification and generalization of Monteiro and Svaiter (2013), it is largely motivated by the aforementioned continuous-time analysis, demonstrating the fundamental role that the feedback control plays in optimal acceleration and the clear advantage that the continuous-time perspective brings to algorithmic design. A highlight of our analysis is that we show that all of the p-th order optimal tensor algorithms that we discuss minimize the squared gradient norm at a rate of O(k −3 p ), which complements the recent analysis in Gasnikov et al. (in: COLT, PMLR, pp 1374–1391, 2019), Jiang et al. (in: COLT, PMLR, pp 1799–1801, 2019) and Bubeck et al. (in: COLT, PMLR, pp 492–507, 2019). B Tianyi Lin Michael I. Jordan 1 Department of Electrical Engineering and Computer Science, UC Berkeley, Berkeley, USA 2 Department of Statistics, UC Berkeley, Berkeley, USA 123 T. Lin, M. I. Jordan Keywords Convex optimization · Optimal acceleration · Closed-loop control system · Feedback control · High-order tensor algorithm · Iteration complexity Mathematics Subject Classification 37N40 · 90C25 · 90C60 · 49M37 · 68Q25 1 Introduction The interplay between continuous-time and discrete-time perspectives on dynamical systems has made a major impact on optimization theory. Classical examples include (1) the interpretation of steepest descent, heavy ball and proximal algorithms as the explicit and implicit discretization of gradient-like dissipative systems [4,5,10,24, 25,98]; and (2) the explicit discretization of Newton-like and Levenberg–Marquardt regularized systems [1,6,7,12,26–28,32–34,79], which give standard and regularized Newton algorithms. One particularly salient way that these connections have spurred research is via the use of Lyapunov functions to transfer asymptotic behavior and rates of convergence between continuous time and discrete time. Recent years have witnessed a flurry of new research focusing on continuous-time perspectives on Nesterov’s accelerated gradient algorithm (NAG) [95] and related methods [38,67,90,108]. These perspectives arise from derivations that obtain differential equations as limits of discrete dynamics [29,30,56,74,86,101,102,106,109], including quasi-gradient formulations and Kurdyka-Lojasiewicz theory [14,39] (see [36,37,52,53,69] for geometrical perspective on the topic), inertial gradient systems with constant or asymptotic vanishing damping [15,20,21,106] and their extension to maximally monotone operators [16,17,45], Hessian-driven damping [6,13,18,28,31, 46,102], time scaling [13,19,21,22], dry friction damping [2,3], closed-loop damping [13,14], control-theoretic design [58,68,77] and Lagrangian and Hamiltonian frameworks [40,55,59,60,78,87,96,110]. Examples of hitherto unknown results that have arisen from this line of research include the fact that NAG achieves a fast rate of o(k −2 ) in terms of objective function gap [20,29,83] and O(k −3 ) in terms of squared gradient norm [102]. The introduction of the Hessian-driven damping into continuous-time dynamics has been a particular milestone in optimization and mechanics. The precursor of this perspective can be found in the variational characterization of the Levenberg– Marquardt method and Newton’s method [7], a development that inspired work on continuous-time Newton-like approaches for convex minimization [7,32] and monotone inclusions [1,12,26,27,33,34,79]. Building on these works, [6] distinguished Hessian-driven damping from classical continuous Newton formulations and showed its importance in optimization and mechanics. Subsequently, [31] demonstrated the connection between Hessian-driven damping and the forward-backward algorithms in Nesterov acceleration (e.g., FISTA), and combined Hessian-driven damping with asymptotically vanishing damping [106]. The resulting dynamics takes the following form: ẍ(t) + 123 α ẋ(t) + β∇ 2 Φ(x(t))ẋ(t) + ∇Φ(x(t)) = 0, t (1) A control-theoretic perspective on optimal... where it is worth mentioning that the presence of the Hessian does not entail numerical difficulties since it arises in the form ∇ 2 Φ(x(t))ẋ(t), which is the time derivative of the function t → ∇Φ(x(t)). Further work in this vein appeared in [102], where Nesterov acceleration was interpreted via multiscale limits that yield high-resolution differential equations: √ √ 3 3 s ∇Φ(x(t)) = 0. ẍ(t) + ẋ(t) + s∇ 2 Φ(x(t))ẋ(t) + 1 + t 2t (2) These limits were used in particular to distinguish between Polyak’s heavy-ball method and NAG, which are not distinguished by naive limiting arguments that yield the same differential equation for both. Althought the coefficients are different in Eqs. (1) and (2), both contain Hessiandriven damping, which corresponds to a correction term obtained via discretization, and which provides fast convergence to zero of the gradients and reduces the oscillatory aspects. Using this viewpoint, several subtle analyses have been recently provided in work independent of ours [13,14]. In particular, they develop a convergence theory for a general inertial system with asymptotic vanishing damping and Hessian-driven damping. Under certain conditions, the fast convergence is guaranteed in terms of both objective function gap and squared gradient norm. Beyond the aforementioned line of work, however, most of the focus in using continuous-time perspectives to shed light on acceleration has been (...truncated)