QUICK REVIEW

[论文解读] Selection via Proxy: Efficient Data Selection for Deep Learning

Cody Coleman, Christopher Yeh|arXiv (Cornell University)|Jun 26, 2019

Machine Learning and Algorithms参考文献 54被引用 76

一句话总结

通过代理选择（SVP）使用小型、快速的代理模型来执行深度学习中的主动学习和核心集选择的数据选择，在多数据集上实现大幅加速，同时最终准确率的损失最小。

ABSTRACT

Data selection methods, such as active learning and core-set selection, are useful tools for machine learning on large datasets. However, they can be prohibitively expensive to apply in deep learning because they depend on feature representations that need to be learned. In this work, we show that we can greatly improve the computational efficiency by using a small proxy model to perform data selection (e.g., selecting data points to label for active learning). By removing hidden layers from the target model, using smaller architectures, and training for fewer epochs, we create proxies that are an order of magnitude faster to train. Although these small proxy models have higher error rates, we find that they empirically provide useful signals for data selection. We evaluate this "selection via proxy" (SVP) approach on several data selection tasks across five datasets: CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity, and Amazon Review Full. For active learning, applying SVP can give an order of magnitude improvement in data selection runtime (i.e., the time it takes to repeatedly train and select points) without significantly increasing the final error (often within 0.1%). For core-set selection on CIFAR10, proxies that are over 10x faster to train than their larger, more accurate targets can remove up to 50% of the data without harming the final accuracy of the target, leading to a 1.6x end-to-end training time improvement.

研究动机与目标

为深度学习激励数据选择方法（主动学习和核心集选），并解决它们的高计算成本。
提出使用 SVP 用更便宜的代理表示来替代昂贵的目标模型表示以进行选择。
证明基于代理的选择在显著降低数据选择时间的同时能够保留最终准确率，在多数据集上成立。
给出代理模型与目标模型在排序上的相关性的实证证据，以证明在选择过程中使用代理的合理性。

提出的方法

通过缩减深度/宽度或减少训练轮次来创建廉价的代理模型，以近似目标模型的决策边界。
在计算选择度量（不确定性、基于距离的多样性、遗忘事件）时，用代理表示替代目标模型表示。
将 SVP 应用于两种数据选择范式：(i) 使用最小置信度和贪心 k-centers 的主动学习；(ii) 使用遗忘事件、熵和贪心 k-centers 的核心集选择。
将选择结果与在全量数据上训练的目标模型进行比较，以评估对最终测试误差的影响。
评估代理和目标排序之间的相关性（斯皮尔曼/皮尔逊），以解释代理有效性的原因。
使用数据集 CIFAR-10/100、ImageNet、Amazon Review Polarity 和 Amazon Review Full，并以 ResNet 变体和文本分类器等模型作为代理和目标。

实验结果

研究问题

RQ1相比大型目标模型，较小的代理模型是否能提供用于选择信息量较大的数据点的可靠排序？
RQ2SVP 在主动学习和核心集任务中能实现哪些数据选择速度提升（运行时）？
RQ3基于代理的选择是否在多样数据集和模态下维持与基于目标模型的选择相似的最终测试准确率？
RQ4排序信号（不确定性、遗忘事件、熵、k-centers）在代理和目标模型之间的相关性有多强？
RQ5SVP 是否在图像分类以外的架构和任务中具有广泛适用性？

主要发现

在主动学习中，SVP 在 Amazon Review Polarity 和 Full 的数据选择运行时最多实现 41.9x 的加速，在 CIFAR-10/100 上最多实现 7x。
SVP 能实现极小的最终准确率损失，通常在各任务中仅低于基线目标模型选择约 0.1%。
使用代理进行核心集选择可在 CIFAR-10 数据中去除多达 50%，而 ResNet164 的准确率没有显著下降，达到约 1.6x 的端到端训练加速。
代理模型在较少训练轮次或较小架构下，与大型目标模型在不确定性、遗忘事件和 k-centers 的排序上高度相关。
跨数据集和架构，代理排序与大型模型显示出高斯皮尔曼/皮尔逊相关，支持 SVP 的广泛适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。