An Accelerated First-Order Method for Non-convex Optimization on Manifolds
Foundations of Computational Mathematics
https://doi.org/10.1007/s10208-022-09573-9
An Accelerated First-Order Method for Non-convex
Optimization on Manifolds
Christopher Criscitiello1 · Nicolas Boumal1
Received: 19 August 2020 / Revised: 3 February 2022 / Accepted: 1 March 2022
© The Author(s) 2022
Abstract
We describe the first gradient methods on Riemannian manifolds to achieve accelerated rates in the non-convex case. Under Lipschitz assumptions on the Riemannian
gradient and Hessian of the cost function, these methods find approximate first-order
critical points faster than regular gradient descent. A randomized version also finds
approximate second-order critical points. Both the algorithms and their analyses build
extensively on existing work in the Euclidean case. The basic operation consists in running the Euclidean accelerated gradient descent method (appropriately safe-guarded
against non-convexity) in the current tangent space, then moving back to the manifold and repeating. This requires lifting the cost function from the manifold to the
tangent space, which can be done for example through the Riemannian exponential
map. For this approach to succeed, the lifted cost function (called the pullback) must
retain certain Lipschitz properties. As a contribution of independent interest, we prove
precise claims to that effect, with explicit constants. Those claims are affected by the
Riemannian curvature of the manifold, which in turn affects the worst-case complexity
bounds for our optimization algorithms.
Keywords Optimization on manifolds · Accelerated gradient descent ·
Non-convex optimization · First-order method · Riemannian manifold · Jacobi field ·
Curvature
Mathematics Subject Classification 65K05 · 65J05 · 90C26 · 90C48 · 90C60 · 58C05
Communicated by James Renegar.
B Nicolas Boumal
Christopher Criscitiello
1
Ecole Polytechnique Fédérale de Lausanne (EPFL), Institute of Mathematics, EPFL FSB SMA,
Station 8, 1015 Lausanne, Switzerland
123
Foundations of Computational Mathematics
1 Introduction
We consider optimization problems of the form
min f (x)
x∈M
(P)
where f is lower-bounded and twice continuously differentiable on a Riemannian
manifold M. For the special case where M is a Euclidean space, problem (P) amounts
to smooth, unconstrained optimization. The more general case is important for applications notably in scientific computing, statistics, imaging, learning, communications
and robotics: see for example [1, 27].
For a general non-convex objective f , computing a global minimizer of (P) is
hard. Instead, our goal is to compute approximate first- and second-order critical
points of (P). A number of non-convex problems of interest exhibit the property that
second-order critical points are optimal [7, 11, 14, 24, 30, 36, 49]. Several of these
are optimization problems on nonlinear manifolds. Therefore, theoretical guarantees
for approximately finding second-order critical points can translate to guarantees for
approximately solving these problems.
It is therefore natural to ask for fast algorithms which find approximate secondorder critical points on manifolds, within a tolerance (see below). Existing algorithms
include RTR [13], ARC [2] and perturbed RGD [20, 44]. Under some regularity conditions, ARC uses Hessian-vector products to achieve a rate of O( −7/4 ). In contrast,
under the same regularity conditions, perturbed RGD uses only function value and
gradient queries, but achieves a poorer rate of O( −2 ). Does there exist an algorithm
which finds approximate second-order critical points with a rate of O( −7/4 ) using
only function value and gradient queries? The answer was known to be yes in Euclidean
space. Can it also be done on Riemannian manifolds, hence extending applicability
to applications treated in the aforementioned references? We resolve that question
positively with the algorithm PTAGD below.
From a different perspective, the recent success of momentum-based first-order
methods in machine learning [42] has encouraged interest in momentum-based firstorder algorithms for non-convex optimization which are provably faster than gradient
descent [15, 28]. We show such provable guarantees can be extended to optimization under a manifold constraint. From this perspective, our paper is part of a body
of work theoretically explaining the success of momentum methods in non-convex
optimization.
There has been significant difficulty in accelerating geodesically convex optimization on Riemannian manifolds. See “Related literature” below for more details on best
known bounds [3] as well as results proving that acceleration in certain settings is
impossible on manifolds [26]. Given this difficulty, it is not at all clear a priori that it is
possible to accelerate non-convex optimization on Riemannian manifolds. Our paper
shows that it is in fact possible.
We design two new algorithms and establish worst-case complexity bounds under
Lipschitz assumptions on the gradient and Hessian of f . Beyond a theoretical contribution, we hope that this work will provide an impetus to look for more practical fast
first-order algorithms on manifolds.
123
Foundations of Computational Mathematics
More precisely, if the gradient of f is L-Lipschitz continuous (in the Riemannian
sense defined below), it is known that Riemannian gradient descent can find an approximate first-order critical point1 in at most O( f L/ 2 ) queries, where f
upper-bounds the gap between initial and optimal cost value [8, 13, 47]. Moreover,
this rate is optimal in the special case where M is a Euclidean space [16], but it can
be improved under the additional assumption that the Hessian of f is ρ-Lipschitz
continuous.
Recently in Euclidean space, Carmon et al. [15] have proposed a deterministic
algorithm for this setting (L-Lipschitz gradient, ρ-Lipschitz Hessian) which requires
at most Õ( f L 1/2 ρ 1/4 / 7/4 ) queries (up to logarithmic factors), and is independent
of
˜
dimension. This is a speed up of Riemannian gradient descent by a factor of ( √L ).
ρ
For the Euclidean case, it has been shown under these assumptions that first-order
methods require at least ( f L 3/7 ρ 2/7 / 12/7 ) queries [17,Thm. 2]. This leaves a
gap of merely Õ(1/ 1/28 ) in the -dependency.
Soon after, Jin et al. [28] showed how a related algorithm with randomization can
√
find (, ρ)-approximate second-order critical points2 with the same complexity, up
to polylogarithmic factors in the dimension of the search space and in the (reciprocal
of) the probability of failure.
Both the algorithm of Carmon et al. [15] and that of Jin et al. [28] fundamentally
rely on Nesterov’s accelerated gradient descent method (AGD) [40], with safe-guards
against non-convexity. To achieve improved rates, AGD builds heavily on a notion
of momentum which accumulates across several iterations. This makes it delicate
to extend AGD to nonlinear manifolds, as it would seem that we need to transfer
m (...truncated)