QUICK REVIEW

[论文解读] A Multi-scale Multiple Instance Video Description Network

Huijuan Xu, Subhashini Venugopalan|arXiv (Cornell University)|May 21, 2015

Multimodal Machine Learning Applications参考文献 8被引用 46

一句话总结

本文提出多尺度多实例视频描述网络（MM-VDN），一种端到端可训练的架构，结合全卷积网络（FCNs）与多实例学习（MIL），以在视频帧中检测并定位不同尺度和位置的物体。通过将多尺度FCN特征与序列到序列LSTM相结合，MM-VDN生成的视频描述比单尺度CNN基线模型更加准确和详细，在YouTube视频描述基准测试中达到最先进性能。

ABSTRACT

Generating natural language descriptions for in-the-wild videos is a challenging task. Most state-of-the-art methods for solving this problem borrow existing deep convolutional neural network (CNN) architectures (AlexNet, GoogLeNet) to extract a visual representation of the input video. However, these deep CNN architectures are designed for single-label centered-positioned object classification. While they generate strong semantic features, they have no inherent structure allowing them to detect multiple objects of different sizes and locations in the frame. Our paper tries to solve this problem by integrating the base CNN into several fully convolutional neural networks (FCNs) to form a multi-scale network that handles multiple receptive field sizes in the original image. FCNs, previously applied to image segmentation, can generate class heat-maps efficiently compared to sliding window mechanisms, and can easily handle multiple scales. To further handle the ambiguity over multiple objects and locations, we incorporate the Multiple Instance Learning mechanism (MIL) to consider objects in different positions and at different scales simultaneously. We integrate our multi-scale multi-instance architecture with a sequence-to-sequence recurrent neural network to generate sentence descriptions based on the visual representation. Ours is the first end-to-end trainable architecture that is capable of multi-scale region processing. Evaluation on a Youtube video dataset shows the advantage of our approach compared to the original single-scale whole frame CNN model. Our flexible and efficient architecture can potentially be extended to support other video processing tasks.

研究动机与目标

解决单尺度、整帧CNN在复杂视频帧中检测小物体或多物体时的局限性。
在不依赖边界框或实例级标注的情况下，实现视频描述的端到端训练，同时处理物体尺度、位置和数量的不确定性。
通过整合空间定位的多尺度视觉表征，提升视频字幕生成性能。
利用句子级别的弱监督标注进行训练，无需边界框或实例级标注。

提出的方法

将预训练的AlexNet转换为全卷积网络（FCN），以在多个输入尺度下生成类别得分图。
使用多个输入分辨率不同的FCN，捕获不同感受野大小的特征，从而实现对小物体和大物体的检测。
在每个尺度上应用多实例学习（MIL）机制，基于句子字幕的弱监督选择最相关的区域和尺度。
将经过MIL处理的多尺度特征整合到序列到序列LSTM解码器中，生成自然语言描述。
使用真实句子标注的交叉熵损失，对整个网络进行端到端训练。
使用ImageNet的预训练权重初始化CNN组件，以提升特征质量并加快收敛。

实验结果

研究问题

RQ1与单尺度整帧CNN相比，多尺度特征提取是否能显著提升视频描述质量？
RQ2在无实例级标注的情况下，多实例学习（MIL）在定位相关视觉概念方面的有效性如何？
RQ3将多尺度FCN特征与端到端可训练架构结合，能在多大程度上提升字幕生成性能？
RQ4不同输入尺度和训练策略如何影响模型检测和描述视频中较小或较远物体的能力？

主要发现

MM-VDN在生成准确视频描述方面显著优于单尺度CNN基线模型以及现有模型如LSTM-YT和FGM。
该模型生成的描述更加详细且符合上下文，例如能正确识别为“一个男人正在切胡萝卜”而非“一个男人正在切番茄”。
FCN组件生成的热力图能清晰定位小物体如胡萝卜和吉他，证明了多尺度检测的有效性。
多尺度特征与MIL的结合带来了互补性改进，如直方图所示，不同尺度贡献了独特的高分特征。
在70%的测试案例中，MM-VDN生成了部分正确或完全正确的描述，尤其在检测在整帧特征中不可见的动作和物体方面表现显著提升。
与基线模型相比，MM-VDN产生的幻觉性错误更少（例如，不会将“一只乌龟在行走”误述为“一只熊猫在行走”），表明其与真实情况的对齐性更好。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。