QUICK REVIEW

[论文解读] Computational Baby Learning

Xiaodan Liang, Si Liu|arXiv (Cornell University)|Nov 11, 2014

Advanced Image and Video Retrieval Techniques参考文献 36被引用 25

一句话总结

本文提出了一种计算婴儿学习框架，用于弱监督目标检测，该框架利用ImageNet上预训练的CNN先验知识，通过少量正样本的实例学习，以及在无标签视频中追踪多样化实例进行迭代优化。该方法仅使用每类两个标注样本和约20,000个无标签视频，在PASCAL VOC 07/10/12上实现了最先进性能，优于完全监督的基线模型。

ABSTRACT

Intuitive observations show that a baby may inherently possess the capability of recognizing a new visual concept (e.g., chair, dog) by learning from only very few positive instances taught by parent(s) or others, and this recognition capability can be gradually further improved by exploring and/or interacting with the real instances in the physical world. Inspired by these observations, we propose a computational model for slightly-supervised object detection, based on prior knowledge modelling, exemplar learning and learning with video contexts. The prior knowledge is modeled with a pre-trained Convolutional Neural Network (CNN). When very few instances of a new concept are given, an initial concept detector is built by exemplar learning over the deep features from the pre-trained CNN. Simulating the baby's interaction with physical world, the well-designed tracking solution is then used to discover more diverse instances from the massive online unlabeled videos. Once a positive instance is detected/identified with high score in each video, more variable instances possibly from different view-angles and/or different distances are tracked and accumulated. Then the concept detector can be fine-tuned based on these new instances. This process can be repeated again and again till we obtain a very mature concept detector. Extensive experiments on Pascal VOC-07/10/12 object detection datasets well demonstrate the effectiveness of our framework. It can beat the state-of-the-art full-training based performances by learning from very few samples for each object category, along with about 20,000 unlabeled videos.

研究动机与目标

开发一种受婴儿学习启发的计算模型，实现在极少量人工标注数据下的目标检测。
通过利用大规模无标签视频数据，降低深度学习在目标检测中的高标注成本。
通过从多样化、真实世界视频实例中迭代学习，逐步提升检测性能。
证明仅通过两个初始正样本，结合基于视频的实例挖掘与模型微调，即可构建成熟的概念检测器。

提出的方法

使用在ImageNet上预训练的CNN建模先验知识，随后在先前学习到的对象类别上进行领域自适应微调。
通过实例学习构建初始概念检测器，利用中间CNN层的深度特征，为每个给定的正样本训练独立的线性分类器。
在无标签视频中以高置信度检测正样本，并将其作为基于区域的视频追踪的种子，以累积来自不同视角和距离的多样化、可变实例。
通过新追踪到的实例逐步优化概念检测器，同时随着正样本的积累，进一步微调预训练的CNN。
该框架迭代地从在线视频流中挖掘并整合新实例，实现检测器的持续改进。
该方法整合视频上下文信息，以在追踪过程中保持外观一致性和空间对应性，从而增强检测的鲁棒性。

实验结果

研究问题

RQ1能否仅使用每类两个标注的正样本，训练出高精度的目标检测器？
RQ2基于视频的追踪在发现用于概念优化的多样化、可变实例方面有多高效？
RQ3整合无标签视频数据是否能显著提升在极小监督设置下的检测性能？
RQ4通过追踪实例进行迭代优化，在多大程度上优于完全监督的训练基线？
RQ5使用挖掘到的数据对预训练CNN进行微调，是否能进一步提升检测器性能？

主要发现

所提出的框架在PASCAL VOC 2007上仅使用每类两个正样本和约20,000个无标签视频，实现了68.9%的mAP，优于完全监督的R-CNN基线模型。
在仅使用两个初始种子和基于视频的挖掘方法时，该方法在VOC 2007上实现了65.3%的mAP，使用排除VOC相关类别的微重新训练的CNN，与完全训练的R-CNN_NIN_BB（65.4% mAP）性能相当。
当应用于在所有VOC 2007图像上训练的完整R-CNN模型时，该框架在VOC 2007上将mAP提升了3.5个百分点（达到62.0%）。
在VOC 2012上，该方法使用Network-in-Network架构实现了68.9%的mAP，优于完全监督的R-CNN_NIN_BB（65.4% mAP）。
该方法对种子选择具有鲁棒性，在针对飞机类别的十次随机种子试验中，mAP均值为68.5%，仅略低于使用默认种子选择时的68.9%。
可视化结果证实，框架能成功追踪来自不同视角、遮挡情况和外观变化的多样化实例，验证了视频上下文信息在挖掘数据多样性方面的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。