Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution
Foundations of Computational Mathematics
https://doi.org/10.1007/s10208-019-09429-9
Implicit Regularization in Nonconvex Statistical Estimation:
Gradient Descent Converges Linearly for Phase Retrieval,
Matrix Completion, and Blind Deconvolution
Cong Ma1 · Kaizheng Wang1 · Yuejie Chi2 · Yuxin Chen3
Received: 14 December 2017 / Revised: 8 May 2019 / Accepted: 18 June 2019
© The Author(s) 2019
Abstract
Recent years have seen a flurry of activities in designing provably efficient nonconvex
procedures for solving statistical estimation problems. Due to the highly nonconvex
nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g., trimming, regularized cost, projection) in order to guarantee fast
convergence. For vanilla procedures such as gradient descent, however, prior theory
either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon
in nonconvex optimization: even in the absence of explicit regularization, gradient
descent enforces proper regularization implicitly under various statistical models. In
fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This “implicit
regularization” feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational
savings. Focusing on three fundamental statistical estimation problems, i.e., phase
retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without
explicit regularization. In particular, by marrying statistical modeling with generic
optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a by-product, for noisy
matrix completion, we demonstrate that gradient descent achieves near-optimal error
control—measured entrywise and by the spectral norm—which might be of independent interest.
Keywords Nonconvex optimization · Gradient descent · Leave-one-out analysis ·
Phase retrieval · Matrix completion · Blind deconvolution
Communicated by Emmanuel J. Candès.
Extended author information available on the last page of the article
123
Foundations of Computational Mathematics
Mathematics Subject Classification 90C26
1 Introduction
1.1 Nonlinear Systems and Empirical Loss Minimization
A wide spectrum of science and engineering applications calls for solutions to a
nonlinear system of equations. Imagine we have collected a set of data points y =
{y j }1≤ j≤m , generated by a nonlinear sensing system,
y j ≈ A j x , 1 ≤ j ≤ m,
where x is the unknown object of interest and the A j ’s are certain nonlinear maps
known a priori. Can we reconstruct the underlying object x in a faithful yet efficient
manner? Problems of this kind abound in information and statistical science, prominent
examples including low-rank matrix recovery [19,64], robust principal component
analysis [17,21], phase retrieval [20,59], neural networks [103,132], to name just a
few.
In principle, it is possible to attempt reconstruction by searching for a solution that
minimizes the empirical loss, namely
minimize x
f (x) =
m
y j − A j (x)2 .
(1)
j=1
Unfortunately, this empirical loss minimization problem is, in many cases, nonconvex,
making it NP-hard in general. This issue of nonconvexity comes up in, for example,
several representative problems that epitomize the structures of nonlinear systems
encountered in practice.1
• Phase retrieval/solving quadratic systems of equations Imagine we are asked to
recover an unknown object x ∈ Rn , but are only given the square modulus
of certain linear measurements about the object, with all sign/phase information
of the measurements missing. This arises, for example, in X-ray crystallography
[15], and in latent-variable models where the hidden variables are captured by the
missing signs [33]. To fix ideas, assume we would like to solve for x ∈ Rn in the
following quadratic system of m equations
2
y j = aj x ,
1 ≤ j ≤ m,
1 Here, we choose different pre-constants in front of the empirical loss in order to be consistent with the
literature of the respective problems. In addition, we only introduce the problem in the noiseless case for
simplicity of presentation.
123
Foundations of Computational Mathematics
where {a j }1≤ j≤m are the known design vectors. One strategy is thus to solve the
following problem
minimize x∈Rn
f (x) =
m
2 2
1
y j − aj x
.
4m
(2)
j=1
• Low-rank matrix completion In many scenarios such as collaborative filtering, we
wish to make predictions about all entries of an (approximately) low-rank matrix
M ∈ Rn×n (e.g., a matrix consisting of users’ ratings about many movies), yet
only a highly incomplete subset of the entries are revealed to us [19]. For clarity
of presentation, assume M to be rank-r (r n) and positive semidefinite (PSD),
i.e., M = X X with X ∈ Rn×r , and suppose we have only seen the entries
Y j,k = M j,k = (X X ) j,k ,
( j, k) ∈
within some index subset of cardinality m. These entries can be viewed as
nonlinear measurements about the low-rank factor X . The task of completing the
true matrix M can then be cast as solving
minimize X∈Rn×r
f (X) =
2
n2
Y j,k − ej X X ek ,
4m
(3)
( j,k)∈
where the e j ’s stand for the canonical basis vectors in Rn .
• Blind deconvolution/solving bilinear systems of equations Imagine we are interested in estimating two signals of interest h , x ∈ C K , but only get to collect a
few bilinear measurements about them. This problem arises from mathematical
modeling of blind deconvolution [3,76], which frequently arises in astronomy,
imaging, communications, etc. The goal is to recover two signals from their convolution. Put more formally, suppose we have acquired m bilinear measurements
taking the following form
y j = bHj h x H a j ,
1 ≤ j ≤ m,
where a j , b j ∈ C K are distinct design vectors (e.g., Fourier and/or random design
vectors) known a priori and bHj denotes the conjugate transpose of b j . In order to
reconstruct the underlying signals, one asks for solutions to the following problem
minimizeh,x∈C K
f (h, x) =
m
y j − bH hx H a j 2 .
j
j=1
1.2 Nonconvex Optimization via Regularized Gradient Descent
First-order methods have been a popular heuristic in practice for solving nonconvex
problems including (1). For instance, a widely adopted procedure is gradient descent,
which follows the update rule
123
Foundations of Computational Mathematics
x t+1 = x t − ηt ∇ f x t ,
t ≥ 0,
(4)
where ηt is the learning rate (or step size) and x 0 is some proper initial guess. Given
that it only performs a single gradient calculation (...truncated)