QUICK REVIEW

[论文解读] Less is More: An Exploration of Data Redundancy with Active Dataset Subsampling.

Kashyap Chitta, José M. Alvarez|arXiv (Cornell University)|May 29, 2019

Machine Learning and Algorithms参考文献 15被引用 12

一句话总结

本文提出一种可扩展的主动学习方法，通过使用数百个模型的集成，从大规模数据集（10k–500k个样本）中识别并子采样最具信息量的训练数据。通过重用中间训练检查点，该方法高效地选择了高质量子集，从而提升模型准确率并减少训练时间，在CIFAR-10、CIFAR-100、ImageNet以及一个生产规模的目标检测基准上均展现出显著提升。

ABSTRACT

Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size of such datasets, it is conceivable that they contain certain samples that either do not contribute or negatively impact the DNN's optimization. Modifying the training distribution in a way that excludes such samples could provide an effective solution to both improve performance and reduce training time. In this paper, we propose to scale up ensemble Active Learning (AL) methods to perform acquisition at a large scale (10k to 500k samples at a time). We do this with ensembles of hundreds of models, obtained at a minimal computational cost by reusing intermediate training checkpoints. This allows us to automatically and efficiently perform a training data subset search for large labeled datasets. We observe that our approach obtains favorable subsets of training data, which can be used to train more accurate DNNs than training with the entire dataset. We perform an extensive experimental study of this phenomenon on three image classification benchmarks (CIFAR-10, CIFAR-100 and ImageNet), as well as an internal object detection benchmark for prototyping perception models for autonomous driving. Unlike existing studies, our experiments on object detection are at the scale required for production-ready autonomous driving systems. We provide insights on the impact of different initialization schemes, acquisition functions and ensemble configurations at this scale. Our results provide strong empirical evidence that optimizing the training data distribution can provide significant benefits on large scale vision tasks.

研究动机与目标

研究通过数据子采样优化训练数据分布是否能够提升深度神经网络性能并减少训练时间。
将主动学习方法扩展至可处理大规模数据集（10k–500k个样本），这些数据集通常用于现实世界中的视觉应用。
探讨不同初始化方案、获取函数以及集成配置对大规模数据子集选择的影响。
在与生产相关的基准上评估所提出的方法，包括一个大规模用于自动驾驶的目标检测数据集。

提出的方法

利用在中间检查点上训练的数百个深度神经网络集成，实现高效的大规模主动学习获取。
利用集成预测结果估计样本的不确定性和信息量，从而选择最具价值的训练数据子集。
在集成中应用如不确定性采样和基于委员会的查询等获取函数，以大规模识别信息丰富的样本。
重用正在进行训练过程中的模型检查点，以最小化计算开销，实现对大规模数据子集的快速迭代。
通过基于集成分歧或不确定性得分选择最前k个最具信息量的样本，实现数据子采样。
在所选子集上训练最终模型，并与在完整数据集上训练的模型在多个基准上进行性能比较。

实验结果

研究问题

RQ1在大规模（10k–500k个样本）场景下，主动学习是否能相比在完整数据集上训练获得更高的模型准确率？
RQ2不同的初始化方案如何影响基于模型集成的大规模主动学习性能？
RQ3在大规模设置下，不同获取函数对所选训练子集质量有何影响？
RQ4集成配置（如模型数量、训练调度）如何影响数据子采样的有效性？
RQ5通过该方法进行数据子采样是否能在生产规模的视觉任务（如自动驾驶目标检测）中实现性能提升？

主要发现

所提出的方法通过仅选择最具信息量的样本，在CIFAR-10、CIFAR-100和ImageNet上实现了比完整数据集训练更高的准确率。
基于集成的主动学习所选子集在所有基准上均减少了训练时间，同时保持或提升了模型性能。
该方法在与自动驾驶相关的大型内部目标检测基准上表现出色，验证了其在生产系统中的适用性。
不同获取函数带来的增益各不相同，其中基于不确定性的方法在所有数据集上均表现出一致的改进。
具有更高模型多样性及合适初始化方案的集成配置，能实现更有效的数据子集选择。
重用中间训练检查点使得大规模可扩展且计算高效的主动学习成为可能，使大规模数据优化成为现实。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。