QUICK REVIEW

[论文解读] The Landscape of Empirical Risk for Non-convex Losses

Mei Song, Yu Bai|arXiv (Cornell University)|Jul 22, 2016

Machine Learning and Algorithms参考文献 44被引用 59

一句话总结

本文建立了非凸损失下经验风险的梯度和海森矩阵对总体对应物的统一收敛性，实现了经验风险与总体风险平稳点之间的一一对应。研究表明，在较弱的样本量条件下（n ≳ p log n），梯度下降等算法在非凸二值分类、鲁棒回归和高斯混合模型等问题中可收敛至全局最小值。

ABSTRACT

Most high-dimensional estimation and prediction methods propose to minimize a cost function (empirical risk) that is written as a sum of losses associated to each data point. In this paper we focus on the case of non-convex losses, which is practically important but still poorly understood. Classical empirical process theory implies uniform convergence of the empirical risk to the population risk. While uniform convergence implies consistency of the resulting M-estimator, it does not ensure that the latter can be computed efficiently. In order to capture the complexity of computing M-estimators, we propose to study the landscape of the empirical risk, namely its stationary points and their properties. We establish uniform convergence of the gradient and Hessian of the empirical risk to their population counterparts, as soon as the number of samples becomes larger than the number of unknown parameters (modulo logarithmic factors). Consequently, good properties of the population risk can be carried to the empirical risk, and we can establish one-to-one correspondence of their stationary points. We demonstrate that in several problems such as non-convex binary classification, robust regression, and Gaussian mixture model, this result implies a complete characterization of the landscape of the empirical risk, and of the convergence properties of descent algorithms. We extend our analysis to the very high-dimensional setting in which the number of parameters exceeds the number of samples, and provide a characterization of the empirical risk landscape under a nearly information-theoretically minimal condition. Namely, if the number of samples exceeds the sparsity of the unknown parameters vector (modulo logarithmic factors), then a suitable uniform convergence result takes place. We apply this result to non-convex binary classification and robust regression in very high-dimension.

研究动机与目标

理解在经典凸性假设不成立的高维非凸设置下，M-估计器的计算复杂度。
刻画经验风险的景观特征——特别是平稳点及其稳定性——针对非凸损失函数。
建立在非凸性下，下降算法仍能收敛至全局最小值的条件。
在稀疏性假设下，将这些结果推广至高维情形，其中 p ≫ n。
为鲁棒回归和混合模型等非凸优化的实证成功提供理论基础。

提出的方法

提出一个通过梯度与海森矩阵对总体对应物的统一收敛性来研究经验风险景观的框架。
利用经验过程理论证明：若 n ≳ p log n，则经验风险继承总体风险的几何特性。
在较弱的正则性条件下，建立经验风险与总体风险平稳点之间的一一对应关系。
将该框架应用于三个典型问题：非凸二值分类、使用非凸 ρ-函数的鲁棒回归，以及高斯混合模型。
通过假设稀疏性，将分析扩展至高维情形，证明当 n ≳ s log n 时，梯度与海森矩阵仍保持统一收敛，其中 s 为真实参数的稀疏度。
利用信赖域方法，基于推导出的景观特性，证明全局收敛至全局最小值。

实验结果

研究问题

RQ1在非凸 M-估计中，经验风险景观在何种条件下能反映总体风险景观？
RQ2梯度下降或信赖域方法等下降算法能否在非凸问题中全局收敛至全局最小值？
RQ3样本量 n 与参数数量 p（或稀疏度 s）之间应满足何种关系，以确保经验风险继承总体风险的有利几何特性？
RQ4梯度与海森矩阵的统一收敛性在建立非凸优化收敛保证中起什么作用？
RQ5在高维情形下，当 p ≫ n 时，若满足稀疏性假设，是否仍可实现非凸 M-估计器的全局收敛？

主要发现

当 n ≳ p log n 时，经验风险的梯度与海森矩阵统一收敛至总体风险的对应量，确保了平稳点之间的一一对应。
对于使用平方损失的非凸二值分类问题，经验风险景观在真实参数附近恰好存在两个局部最小值，且下降方法可收敛至其中之一。
在使用非凸 ρ-函数的鲁棒回归中，当满足相同的样本量条件时，经验风险景观继承了无虚假局部最小值的特性。
对于高斯混合模型，经验风险存在三个平稳点：两个局部最小值靠近真实分量均值，一个鞍点位于原点，且信赖域方法可收敛至全局最小值。
在高维情形下，当 p ≫ n 且真实参数为 s-稀疏时，若 n ≳ s log n，则梯度与海森矩阵的统一收敛性依然成立，从而支持下降算法的全局收敛。
当初始点位于原点邻域内且 n ≳ d log d 时，信赖域方法可收敛至高斯混合模型的全局最小值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。