QUICK REVIEW

[论文解读] Deep Bayesian Active Learning with Image Data

Yarin Gal, Riashat Islam|arXiv (Cornell University)|Mar 8, 2017

Machine Learning and Algorithms参考文献 29被引用 577

一句话总结

本论文提出一种用于高维图像数据的主动学习框架，利用贝叶斯卷积神经网络和基于MC dropout的不确定性量化，在MNIST和黑色素瘤诊断任务上相比基线方法和半监督方法实现了更高的标注效率。

ABSTRACT

Even though active learning forms an important pillar of machine learning, deep learning tools are not prevalent within it. Deep learning poses several difficulties when used in an active learning setting. First, active learning (AL) methods generally rely on being able to learn and update models from small amounts of data. Recent advances in deep learning, on the other hand, are notorious for their dependence on large amounts of data. Second, many AL acquisition functions rely on model uncertainty, yet deep learning methods rarely represent such model uncertainty. In this paper we combine recent advances in Bayesian deep learning into the active learning framework in a practical way. We develop an active learning framework for high dimensional data, a task which has been extremely challenging so far, with very sparse existing literature. Taking advantage of specialised models such as Bayesian convolutional neural networks, we demonstrate our active learning techniques with image data, obtaining a significant improvement on existing active learning approaches. We demonstrate this on both the MNIST dataset, as well as for skin cancer diagnosis from lesion images (ISIC2016 task).

研究动机与目标

为高维图像数据的主动学习提供动机并降低标注成本。
开发一个贝叶斯CNN框架来表示预测不确定性。
在图像场景中评估获取函数（BALD、BALD变体、熵、变异比等）。
与基于核的AL、确定性CNN基线和半监督方法进行对比。
在ISIC 2016数据上展示在黑色素瘤分类任务上的实际应用性。

提出的方法

使用带有 dropout 的贝叶斯卷积神经网络作为变分近似来建模权重的不确定性。
在每一层之前进行 dropout 的训练，并在测试时进行MC dropout以从近似后验中采样。
利用MC对近似后验进行采样，定义并近似获取函数（BALD、最大熵、变异比、平均标准差、随机）以估计模型参数的获取值。
通过对近似后验的MC积分推导出计算上可行的BALD及相关获取估计量。
在MNIST上用一个小的初始有标记数据集和一个未标记数据池评估获取策略。
在ISIC 2016 黑色素瘤数据上微调一个预训练的VGG16 CNN以评估在真实世界医学任务中的主动学习。

实验结果

研究问题

RQ1贝叶斯CNN与MC dropout是否能为高维图像数据中的主动学习提供可靠的不确定性估计？
RQ2在使用深度模型时，哪些获取函数（BALD、最大熵、变异比等）能提供最佳的数据标注效率？
RQ3在图像分类任务上，主动学习策略与基于核的AL及半监督方法相比有何差异？
RQ4在标注有限的情况下，黑色素瘤病灶分类中模型不确定性与数据不确定性（本底不确定性）有何影响？

主要发现

表 1：按方法的 MNIST 获取步骤与测试误差（%）	表 2：1000 个有标签样本下的 MNIST 测试误差（百分比）
10%	145	120	165	230	255
5%	335	295	355	695	835
Semi-supervised: Semi-sup. Embedding (Weston et al. 2012)	5.73%
Transductive SVM (Weston et al. 2012)	5.38%
MTC (Rifai et al. 2011)	3.64%
Pseudo-label (Lee, 2013)	3.46%
AtlasRBF (Pitelis et al. 2014)	3.68%
DGN (Kingma et al. 2014)	2.40%
Virtual Adversarial (Miyato et al. 2015)	1.32%
Ladder Network (Γ-model) (Rasmus et al. 2015)	1.53%
Ladder Network (full) (Rasmus et al. 2015)	0.84%
Random	4.66%
BALD	1.80%
Max Entropy	1.74%
Var Ratios	1.64%

在MNIST上，BALD、变异比和最大熵的获取策略在数据利用效率方面优于随机或平均标准差，达到目标错误需要的标注数据更少。
在MNIST中，变异比收敛速度可能更快，而平均标准差常常不如更聪明的获取策略。
在不确定性获取策略下，使用MC dropout的贝叶斯CNN在早期学习阶段和最终准确率上均优于确定性CNN。
与基于核的AL（MBR）在MNIST上的比较中，基于CNN的AL结合BALD/熵获取策略在标注样本更少的情况下取得更高的准确率。
在ISIC黑色素瘤数据上，使用BALD的主动学习能更快提升AUC，并往往获取更多信息量更高的阳性样本，相较于均匀采样。
小数据情形下的主动学习性能对数据划分敏感，凸显医疗影像数据集的变异性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。