QUICK REVIEW

[论文解读] Coresets for Data-efficient Training of Machine Learning Models

Baharan Mirzasoleiman, Jeff Bilmes|arXiv (Cornell University)|Jun 5, 2019

Stochastic Gradient Optimization Techniques被引用 37

一句话总结

CRAIG 选择一个加权数据子集（核心集），能够对完整梯度进行密切近似，从而使增量梯度方法在与使用完整数据相同的速度下收敛，并在实践中带来显著的加速。

ABSTRACT

Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for data-efficient training of general machine learning models. Our extensive set of experiments show that CRAIG, while achieving practically the same solution, speeds up various IG methods by up to 6x for logistic regression and 3x for training deep neural networks.

研究动机与目标

推动数据高效训练，以降低大规模机器学习中的计算成本和能耗。
开发一个有原理的子集选择方法，用一个小的、有权重的核心集来近似完整梯度。
提供理论收敛性保证，表明在子集上的 IG 收敛性等同于在完整数据上的 IG。
展示实际的加速效果及其在凸模型和非凸模型上的适用性。

提出的方法

定义一个目标函数 L(S)，它界定子集 S 相对于完整数据集 V 的梯度估计误差。
将梯度近似目标转化为单调次模的设施定位函数 F，并通过贪心算法求解。
将子集权重 gamma_j 计算为在梯度空间中最接近每个子集元素的分量的计数。
证明任何在 S 上应用的 IG 方法在与完整数据相同的纪元次数内收敛，并且误差项与 epsilon 相关。
提供将 CRAIG 应用于深度网络的实用指南，包括不需要完全反向传播的梯度界近似。

实验结果

研究问题

RQ1一个带权重的小数据子集是否能足够接近完整梯度，以保持 IG 收敛行为？
RQ2在凸问题中，使用 CRAIG 选取的子集对收敛速度和最终解的影响是什么？
RQ3CRAIG 子集是否在不牺牲精度的前提下，为 SGD、SAGA、SVRG 以及深度网络训练提供显著的加速？
RQ4如何将 CRAIG 扩展到梯度界更难计算的深度网络？

主要发现

CRAIG 使子集上的 IG 收敛到与完整数据上的 IG 相同的解，加速比与 |V|/|S| 成正比。
对于强凸问题，CRAIG 子集上的 IG 收敛具有误差项 O(epsilon)，并在常数项内与完整数据的速率相匹配。
实验表明在凸问题上可实现最高6倍加速，在非凸深度网络上最高3倍，同时达到相似的损失和准确率。
CRAIG 子集（在某些情况下甚至只有数据的 10%）可以密切近似完整梯度，并优于随机子集。
CRAIG 兼容 SGD、SAGA 和 SVRG，并在如 Covtype 和 Ijcnn1 这类大规模数据集上带来实际性能提升。
在神经网络中，CRAIG 降低了训练时间，同时在如 MNIST 的两层网络实验中保持或改善了泛化性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。