QUICK REVIEW

[论文解读] Jointly Localizing and Describing Events for Dense Video Captioning

Yehao Li, Ting Yao|arXiv (Cornell University)|Apr 23, 2018

Multimodal Machine Learning Applications参考文献 38被引用 38

一句话总结

本文提出了一种统一的、端到端的深度学习框架，用于密集视频字幕生成，通过新颖的描述性回归组件，联合优化时间事件定位与句子生成。通过将基于语言的反馈整合到检测过程中，并利用属性增强的字幕生成架构，该方法实现了最先进性能，在ActivityNet Captions测试集上达到12.96%的METEOR分数。

ABSTRACT

Automatically describing a video with natural language is regarded as a fundamental challenge in computer vision. The problem nevertheless is not trivial especially when a video contains multiple events to be worthy of mention, which often happens in real videos. A valid question is how to temporally localize and then describe events, which is known as "dense video captioning." In this paper, we present a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner. To combine these two worlds, we integrate a new design, namely descriptiveness regression, into a single shot detection structure to infer the descriptive complexity of each detected proposal via sentence generation. This in turn adjusts the temporal locations of each event proposal. Our model differs from existing dense video captioning methods since we propose a joint and global optimization of detection and captioning, and the framework uniquely capitalizes on an attribute-augmented video captioning architecture. Extensive experiments are conducted on ActivityNet Captions dataset and our framework shows clear improvements when compared to the state-of-the-art techniques. More remarkably, we obtain a new record: METEOR of 12.96% on ActivityNet Captions official test set.

研究动机与目标

为解决长视频中多个事件需要同时实现精确时间定位与描述性句子生成的密集视频字幕挑战。
克服两阶段方法中定位与字幕生成解耦导致的次优性能问题。
探究语言理解如何在联合优化框架中引导并优化时间事件提议。
开发一种统一架构，以端到端方式建模事件检测与句子生成之间的交互。

提出的方法

引入描述性回归组件，估计每个事件提议的语言复杂度，以指导时间定位。
将描述性回归集成到单阶段检测框架中，与事件/背景分类及时间坐标回归联合训练。
利用描述性分数作为注意力机制，对每个提议内的片段级特征进行加权，优化提议级表征。
采用属性增强的字幕生成架构，基于优化后的、注意力加权的提议特征生成自然语言描述。
使用多尺度锚框层（conv3至conv11），逐步降低时间分辨率，以提升对不同持续时间事件的定位精度。
端到端训练整个模型，实现检测与字幕目标的全局优化。

实验结果

研究问题

RQ1在密集视频字幕中，如何有效建模时间事件定位与句子生成之间的交互？
RQ2基于语言的反馈（通过描述性回归）能否提升时间事件提议的准确性？
RQ3检测与字幕的联合优化是否优于顺序或两阶段方法？
RQ4多尺度锚框层对密集视频字幕中定位性能有何影响？

主要发现

所提框架在官方ActivityNet Captions测试集上达到12.96%的新SOTA METEOR分数，超越所有先前方法。
描述性回归组件显著提升了时间事件提议性能，验证集上AUC达到60.07%，优于TAG、DCE与TURN。
使用P3D ResNet特征替代C3D特征，使METEOR分数从12.85%提升至12.96%，证明更丰富的片段级表征具有优势。
消融实验表明，增加具有不同时间分辨率的锚框层数量可提升性能，其中conv3至conv11在准确率与模型复杂度之间达到最佳平衡。
联合训练结合描述性回归显著提升了定位事件与其描述之间的对齐程度，体现在句子相关性与定位召回率的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。