QUICK REVIEW

[论文解读] Diverse mini-batch Active Learning

Fedor Zhdanov|arXiv (Cornell University)|Jan 17, 2019

Machine Learning and Algorithms参考文献 15被引用 92

一句话总结

论文提出了一种可扩展的小批量主动学习方法，通过使用加权K-means聚类在信息量与多样性之间取得平衡，以选择多样且信息丰富的样本进行标注。

ABSTRACT

We study the problem of reducing the amount of labeled training data required to train supervised classification models. We approach it by leveraging Active Learning, through sequential selection of examples which benefit the model most. Selecting examples one by one is not practical for the amount of training examples required by the modern Deep Learning models. We consider the mini-batch Active Learning setting, where several examples are selected at once. We present an approach which takes into account both informativeness of the examples for the model, as well as the diversity of the examples in a mini-batch. By using the well studied K-means clustering algorithm, this approach scales better than the previously proposed approaches, and achieves comparable or better performance.

研究动机与目标

减少为训练监督模型所需的标注数据量。
通过选择小批量来应对用深度模型重新训练的实际约束。
在批次选择中同时考虑信息性和多样性。
提供利用K-means聚类的可扩展解决方案。
在文本和图像数据集上展示在不同模型上的有效性。

提出的方法

将批次选择表述为一个设施选址问题以提高多样性。
使用K-means聚类近似以实现相较子模方法的可扩展性。
通过在加权K-means目标中对聚类中心进行加权来融入信息量分数。
以边际不确定性作为信息量度量。
在聚类前对未标记样本进行预筛选以提高效率。
在每个批次中选择离聚类中心最近的k个样本进行标注。

实验结果

研究问题

RQ1将多样性与信息性结合在小批量选择中，是否比仅基于不确定性的基线更能提升学习效率？
RQ2K-means聚类是否可以提供对信息量感知的多样性批次选择的可扩展近似？
RQ3在文本和图像数据集上，不同模型架构下提出方法的表现如何？
RQ4预过滤参数beta对性能和可扩展性的影响？
RQ5在该设定下，基于边际不确定性是否比基于熵的或其他不确定性度量更有效？

主要发现

面向多样性的小批量选择在多个数据集上通常优于不确定性采样。
聚类方法在速度上显著快于子模优化方法，同时达到可比或更好的准确性。
利用信息性分数进行加权聚类在若干数据集上提升了性能。
通过聚类进行的首批选择可在某些数据集上提高早期准确性。
CIFAR-10 的结果显示基于多样性的方法略优于纯不确定性，且加权聚类通常表现最佳。
总体而言，该方法具备可扩展性，与更复杂的技术相比具有竞争力，同时实现更简单。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。