An Accelerated First-Order Method for Non-convex Optimization on Manifolds (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s10208-022-09573-9.pdf

An Accelerated First-Order Method for Non-convex Optimization on Manifolds

Foundations of Computational Mathematics https://doi.org/10.1007/s10208-022-09573-9 An Accelerated First-Order Method for Non-convex Optimization on Manifolds Christopher Criscitiello1 · Nicolas Boumal1 Received: 19 August 2020 / Revised: 3 February 2022 / Accepted: 1 March 2022 © The Author(s) 2022 Abstract We describe the first gradient methods on Riemannian manifolds to achieve accelerated rates in the non-convex case. Under Lipschitz assumptions on the Riemannian gradient and Hessian of the cost function, these methods find approximate first-order critical points faster than regular gradient descent. A randomized version also finds approximate second-order critical points. Both the algorithms and their analyses build extensively on existing work in the Euclidean case. The basic operation consists in running the Euclidean accelerated gradient descent method (appropriately safe-guarded against non-convexity) in the current tangent space, then moving back to the manifold and repeating. This requires lifting the cost function from the manifold to the tangent space, which can be done for example through the Riemannian exponential map. For this approach to succeed, the lifted cost function (called the pullback) must retain certain Lipschitz properties. As a contribution of independent interest, we prove precise claims to that effect, with explicit constants. Those claims are affected by the Riemannian curvature of the manifold, which in turn affects the worst-case complexity bounds for our optimization algorithms. Keywords Optimization on manifolds · Accelerated gradient descent · Non-convex optimization · First-order method · Riemannian manifold · Jacobi field · Curvature Mathematics Subject Classification 65K05 · 65J05 · 90C26 · 90C48 · 90C60 · 58C05 Communicated by James Renegar. B Nicolas Boumal Christopher Criscitiello 1 Ecole Polytechnique Fédérale de Lausanne (EPFL), Institute of Mathematics, EPFL FSB SMA, Station 8, 1015 Lausanne, Switzerland 123 Foundations of Computational Mathematics 1 Introduction We consider optimization problems of the form min f (x) x∈M (P) where f is lower-bounded and twice continuously differentiable on a Riemannian manifold M. For the special case where M is a Euclidean space, problem (P) amounts to smooth, unconstrained optimization. The more general case is important for applications notably in scientific computing, statistics, imaging, learning, communications and robotics: see for example [1, 27]. For a general non-convex objective f , computing a global minimizer of (P) is hard. Instead, our goal is to compute approximate first- and second-order critical points of (P). A number of non-convex problems of interest exhibit the property that second-order critical points are optimal [7, 11, 14, 24, 30, 36, 49]. Several of these are optimization problems on nonlinear manifolds. Therefore, theoretical guarantees for approximately finding second-order critical points can translate to guarantees for approximately solving these problems. It is therefore natural to ask for fast algorithms which find approximate secondorder critical points on manifolds, within a tolerance (see below). Existing algorithms include RTR [13], ARC [2] and perturbed RGD [20, 44]. Under some regularity conditions, ARC uses Hessian-vector products to achieve a rate of O( −7/4 ). In contrast, under the same regularity conditions, perturbed RGD uses only function value and gradient queries, but achieves a poorer rate of O( −2 ). Does there exist an algorithm which finds approximate second-order critical points with a rate of O( −7/4 ) using only function value and gradient queries? The answer was known to be yes in Euclidean space. Can it also be done on Riemannian manifolds, hence extending applicability to applications treated in the aforementioned references? We resolve that question positively with the algorithm PTAGD below. From a different perspective, the recent success of momentum-based first-order methods in machine learning [42] has encouraged interest in momentum-based firstorder algorithms for non-convex optimization which are provably faster than gradient descent [15, 28]. We show such provable guarantees can be extended to optimization under a manifold constraint. From this perspective, our paper is part of a body of work theoretically explaining the success of momentum methods in non-convex optimization. There has been significant difficulty in accelerating geodesically convex optimization on Riemannian manifolds. See “Related literature” below for more details on best known bounds [3] as well as results proving that acceleration in certain settings is impossible on manifolds [26]. Given this difficulty, it is not at all clear a priori that it is possible to accelerate non-convex optimization on Riemannian manifolds. Our paper shows that it is in fact possible. We design two new algorithms and establish worst-case complexity bounds under Lipschitz assumptions on the gradient and Hessian of f . Beyond a theoretical contribution, we hope that this work will provide an impetus to look for more practical fast first-order algorithms on manifolds. 123 Foundations of Computational Mathematics More precisely, if the gradient of f is L-Lipschitz continuous (in the Riemannian sense defined below), it is known that Riemannian gradient descent can find an approximate first-order critical point1 in at most O( f L/ 2 ) queries, where f upper-bounds the gap between initial and optimal cost value [8, 13, 47]. Moreover, this rate is optimal in the special case where M is a Euclidean space [16], but it can be improved under the additional assumption that the Hessian of f is ρ-Lipschitz continuous. Recently in Euclidean space, Carmon et al. [15] have proposed a deterministic algorithm for this setting (L-Lipschitz gradient, ρ-Lipschitz Hessian) which requires at most Õ( f L 1/2 ρ 1/4 / 7/4 ) queries (up to logarithmic factors), and is independent of ˜ dimension. This is a speed up of Riemannian gradient descent by a factor of ( √L ). ρ For the Euclidean case, it has been shown under these assumptions that first-order methods require at least ( f L 3/7 ρ 2/7 / 12/7 ) queries [17,Thm. 2]. This leaves a gap of merely Õ(1/ 1/28 ) in the -dependency. Soon after, Jin et al. [28] showed how a related algorithm with randomization can √ find (, ρ)-approximate second-order critical points2 with the same complexity, up to polylogarithmic factors in the dimension of the search space and in the (reciprocal of) the probability of failure. Both the algorithm of Carmon et al. [15] and that of Jin et al. [28] fundamentally rely on Nesterov’s accelerated gradient descent method (AGD) [40], with safe-guards against non-convexity. To achieve improved rates, AGD builds heavily on a notion of momentum which accumulates across several iterations. This makes it delicate to extend AGD to nonlinear manifolds, as it would seem that we need to transfer m (...truncated)