QUICK REVIEW

[论文解读] Active Learning for Crowd-Sourced Databases

Barzan Mozafari, Purnamrita Sarkar|arXiv (Cornell University)|Sep 17, 2012

Machine Learning and Algorithms参考文献 63被引用 32

一句话总结

本文提出了两种新颖的主动学习算法——Uncertainty 和 MinExpError，专为众包数据库设计，将分类器视为黑箱，使用非参数自举法估计不确定性，并支持批量处理与并行计算。与基线方法相比，该方法将标注查询次数减少了1至2个数量级，在真实世界和UCI数据集上，相比现有主动学习方法，查询次数减少了4.5至44倍。

ABSTRACT

Crowd-sourcing has become a popular means of acquiring labeled data for a wide variety of tasks where humans are more accurate than computers, e.g., labeling images, matching objects, or analyzing sentiment. However, relying solely on the crowd is often impractical even for data sets with thousands of items, due to time and cost constraints of acquiring human input (which cost pennies and minutes per label). In this paper, we propose algorithms for integrating machine learning into crowd-sourced databases, with the goal of allowing crowd-sourcing applications to scale, i.e., to handle larger datasets at lower costs. The key observation is that, in many of the above tasks, humans and machine learning algorithms can be complementary, as humans are often more accurate but slow and expensive, while algorithms are usually less accurate, but faster and cheaper. Based on this observation, we present two new active learning algorithms to combine humans and algorithms together in a crowd-sourced database. Our algorithms are based on the theory of non-parametric bootstrap, which makes our results applicable to a broad class of machine learning models. Our results, on three real-life datasets collected with Amazon's Mechanical Turk, and on 15 well-known UCI data sets, show that our methods on average ask humans to label one to two orders of magnitude fewer items to achieve the same accuracy as a baseline that labels random images, and two to eight times fewer questions than previous active learning schemes.

研究动机与目标

通过最小化人工标注成本，使众包数据库能够扩展至大规模数据集。
设计通用、可扩展且非专家用户无需修改分类器内部结构即可使用的主动学习算法。
在实际部署于真实世界众包系统时，支持主动学习中的批量处理与并行计算。
在不假设标注者质量一致的前提下，管理来自不可靠众包工作者的标签噪声。
在保持高模型准确率的同时，显著减少所需人工标注样本的数量。

提出的方法

使用非参数自举法在未标注数据上生成多个分类器预测，从而在不修改分类器的前提下实现不确定性估计。
采用 Uncertainty 算法，选择在自举样本中预测方差最高的实例作为最具有信息量的标注对象。
应用 MinExpError 算法，通过选择模型最不确定且具有最大潜力降低误差的实例，以最小化期望误差。
通过同时处理多个实例，支持批量处理与并行计算，提升众包工作流中的运行效率。
将分类器视为黑箱，无需访问其内部参数或修改训练过程。
在初始设置与迭代设置中均集成自适应查询选择机制，每次批量处理后重新训练模型以提升性能。

实验结果

研究问题

RQ1主动学习能否在保持对多样化分类任务的通用性的同时，有效应用于众包数据库？
RQ2如何设计主动学习方法，使其可与任意分类器协同工作，而无需修改其内部结构？
RQ3批量处理与并行计算在多大程度上能提升众包系统中主动学习的效率？
RQ4所提出的方法在实际中如何处理来自不可靠众包工作者的标签噪声？
RQ5基于自举法的不确定性估计是否能在查询效率方面超越现有主动学习策略？

主要发现

在三个真实世界的Mechanical Turk数据集上，所提算法相比基线方法将所需标注查询次数减少了1至2个数量级。
在15个UCI数据集上，该方法相比现有主动学习算法（如IWAL和Bootstrap-LV）实现了4.5至44倍的查询次数减少。
Uncertainty 和 MinExpError 算法在查询效率与准确性方面均优于领域特定方法（如MarginDistance、CrowdER 和 CVHull）。
迭代重训练设置的模型准确率高于初始设置，证明了自适应查询选择的优势。
批量处理显著提升了运行时性能，同时不损害标注质量，使该方法可实现生产环境中的可扩展部署。
黑箱、基于自举法的方法在多样化分类任务中具有良好的泛化能力，且无需对底层分类器或数据分布做任何假设。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。