Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

Foundations of Computational Mathematics, Aug 2019

Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g., trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This “implicit regularization” feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e., phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a by-product, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control—measured entrywise and by the spectral norm—which might be of independent interest.

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs10208-019-09429-9.pdf

Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

Foundations of Computational Mathematics https://doi.org/10.1007/s10208-019-09429-9 Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution Cong Ma1 · Kaizheng Wang1 · Yuejie Chi2 · Yuxin Chen3 Received: 14 December 2017 / Revised: 8 May 2019 / Accepted: 18 June 2019 © The Author(s) 2019 Abstract Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g., trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This “implicit regularization” feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e., phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a by-product, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control—measured entrywise and by the spectral norm—which might be of independent interest. Keywords Nonconvex optimization · Gradient descent · Leave-one-out analysis · Phase retrieval · Matrix completion · Blind deconvolution Communicated by Emmanuel J. Candès. Extended author information available on the last page of the article 123 Foundations of Computational Mathematics Mathematics Subject Classification 90C26 1 Introduction 1.1 Nonlinear Systems and Empirical Loss Minimization A wide spectrum of science and engineering applications calls for solutions to a nonlinear system of equations. Imagine we have collected a set of data points y = {y j }1≤ j≤m , generated by a nonlinear sensing system,   y j ≈ A j x  , 1 ≤ j ≤ m, where x  is the unknown object of interest and the A j ’s are certain nonlinear maps known a priori. Can we reconstruct the underlying object x  in a faithful yet efficient manner? Problems of this kind abound in information and statistical science, prominent examples including low-rank matrix recovery [19,64], robust principal component analysis [17,21], phase retrieval [20,59], neural networks [103,132], to name just a few. In principle, it is possible to attempt reconstruction by searching for a solution that minimizes the empirical loss, namely minimize x f (x) = m     y j − A j (x)2 . (1) j=1 Unfortunately, this empirical loss minimization problem is, in many cases, nonconvex, making it NP-hard in general. This issue of nonconvexity comes up in, for example, several representative problems that epitomize the structures of nonlinear systems encountered in practice.1 • Phase retrieval/solving quadratic systems of equations Imagine we are asked to recover an unknown object x  ∈ Rn , but are only given the square modulus of certain linear measurements about the object, with all sign/phase information of the measurements missing. This arises, for example, in X-ray crystallography [15], and in latent-variable models where the hidden variables are captured by the missing signs [33]. To fix ideas, assume we would like to solve for x  ∈ Rn in the following quadratic system of m equations 2  y j = aj x  , 1 ≤ j ≤ m, 1 Here, we choose different pre-constants in front of the empirical loss in order to be consistent with the literature of the respective problems. In addition, we only introduce the problem in the noiseless case for simplicity of presentation. 123 Foundations of Computational Mathematics where {a j }1≤ j≤m are the known design vectors. One strategy is thus to solve the following problem minimize x∈Rn f (x) = m  2 2 1  y j − aj x . 4m (2) j=1 • Low-rank matrix completion In many scenarios such as collaborative filtering, we wish to make predictions about all entries of an (approximately) low-rank matrix M  ∈ Rn×n (e.g., a matrix consisting of users’ ratings about many movies), yet only a highly incomplete subset of the entries are revealed to us [19]. For clarity of presentation, assume M  to be rank-r (r  n) and positive semidefinite (PSD), i.e., M  = X  X  with X  ∈ Rn×r , and suppose we have only seen the entries Y j,k = M j,k = (X  X  ) j,k , ( j, k) ∈  within some index subset  of cardinality m. These entries can be viewed as nonlinear measurements about the low-rank factor X  . The task of completing the true matrix M  can then be cast as solving minimize X∈Rn×r f (X) = 2 n2   Y j,k − ej X X  ek , 4m (3) ( j,k)∈ where the e j ’s stand for the canonical basis vectors in Rn . • Blind deconvolution/solving bilinear systems of equations Imagine we are interested in estimating two signals of interest h , x  ∈ C K , but only get to collect a few bilinear measurements about them. This problem arises from mathematical modeling of blind deconvolution [3,76], which frequently arises in astronomy, imaging, communications, etc. The goal is to recover two signals from their convolution. Put more formally, suppose we have acquired m bilinear measurements taking the following form y j = bHj h x H a j , 1 ≤ j ≤ m, where a j , b j ∈ C K are distinct design vectors (e.g., Fourier and/or random design vectors) known a priori and bHj denotes the conjugate transpose of b j . In order to reconstruct the underlying signals, one asks for solutions to the following problem minimizeh,x∈C K f (h, x) = m     y j − bH hx H a j 2 . j j=1 1.2 Nonconvex Optimization via Regularized Gradient Descent First-order methods have been a popular heuristic in practice for solving nonconvex problems including (1). For instance, a widely adopted procedure is gradient descent, which follows the update rule 123 Foundations of Computational Mathematics   x t+1 = x t − ηt ∇ f x t , t ≥ 0, (4) where ηt is the learning rate (or step size) and x 0 is some proper initial guess. Given that it only performs a single gradient calculation (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs10208-019-09429-9.pdf
Article home page: https://link.springer.com/article/10.1007/s10208-019-09429-9

Cong Ma, Kaizheng Wang, Yuejie Chi, Yuxin Chen. Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution, Foundations of Computational Mathematics, 2019, pp. 1-182, DOI: 10.1007/s10208-019-09429-9