[Paper Review] The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing
Introduces ScaledGD(λ), a preconditioned gradient descent method for overparameterized low-rank matrix sensing that converges rapidly from small random initialization and is robust to ill-conditioning and noise. It achieves near-minimax optimal error and depends only polylogarithmically on condition number and dimension.
We propose $ extsf{ScaledGD($λ$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $ extsf{ScaledGD($λ$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $ extsf{ScaledGD($λ$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($ extsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $ extsf{ScaledGD($λ$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $ extsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.
Motivation & Objective
- Address low-rank matrix sensing when the true rank is unknown and matrices may be ill-conditioned.
- Develop a preconditioned nonconvex optimization method that remains robust under overparameterization.
- Provide global convergence guarantees from random initialization.
- Characterize performance under measurement noise and approximate low-rankness.
Proposed method
- Introduce ScaledGD(λ), a preconditioned gradient descent with a fixed damping λ: X_{t+1}=X_t - η ∇f(X_t)(X_t^T X_t + λ I)^{-1} where f(X) = (1/4)||A(XX^T)-y||^2.
- Show equivariance of iterates to rotations in the factor X, ensuring M_t = X_t X_t^T is invariant to parameterization.
- Assume rank-(r*+1) RIP for the sensing operator A and small random initialization X_0 = αG with α chosen per Assumption 2.
- Provide global convergence guarantees from random initialization in the overparameterized regime r ≥ r*, with iteration complexity scaling poly-logarithmically with κ (condition number) and n.
- Extend analysis to exact parameterization (r = r*) and to noisy measurements, establishing minimax-optimal error up to κ factors.
- Discuss extension to approximately low-rank matrices under Gaussian design.
Experimental results
Research questions
- RQ1Can ScaledGD(λ) achieve global convergence from small random initialization when the rank is overparameterized (r ≥ r*)?
- RQ2How does preconditioning affect convergence rate and robustness to ill-conditioning compared to vanilla gradient descent?
- RQ3What are the iteration and sample complexities under RIP and Gaussian design?
- RQ4How does ScaledGD(λ) perform in the presence of measurement noise or approximate low-rankness?
- RQ5Do the guarantees extend to the exact parameterization and to approximately low-rank settings?
Key findings
- ScaledGD(λ) converges to the true low-rank matrix at a constant linear rate after a small logarithmic-phase, with iteration count O((log κ)(log κn) + log(1/ε)).
- Under Gaussian design, the sample complexity depends on the true rank r* and not on the overparameterized rank r, provided m ≳ n r*^2 poly(κ).
- In the noisy setting, ScaledGD(λ) attains minimax-optimal error up to a κ factor, with the final error matching rates similar to the noiseless case when ε is tuned.
- Exact parameterization (r = r*) yields convergence to M* from random initialization with an additional logarithmic overhead compared to spectral initialization results.
- The method also extends to the approximately low-rank setting under Gaussian design, maintaining near-optimal recovery of M* or its best rank-r approximation M_r.
- The work demonstrates that preconditioning can accelerate convergence without sacrificing generalization in overparameterized learning.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.