QUICK REVIEW

[论文解读] Weakly Supervised Dense Event Captioning in Videos

Xuguang Duan, Wenbing Huang|arXiv (Cornell University)|Dec 10, 2018

Multimodal Machine Learning Applications被引用 63

一句话总结

本文提出弱监督密集事件描述（WS-DEC），在不使用时间段标注的情况下学习定位和描述视频事件，利用句子定位与描述生成的双循环，以及固定点迭代。

ABSTRACT

Dense event captioning aims to detect and describe all events of interest contained in a video. Despite the advanced development in this area, existing methods tackle this task by making use of dense temporal annotations, which is dramatically source-consuming. This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training. Our solution is based on the one-to-one correspondence assumption, each caption describes one temporal segment, and each temporal segment has one caption, which holds in current benchmark datasets and most real-world cases. We decompose the problem into a pair of dual problems: event captioning and sentence localization and present a cycle system to train our model. Extensive experimental results are provided to demonstrate the ability of our model on both dense event captioning and sentence localization in videos.

研究动机与目标

通过去除密集事件描述中的时间段标注来降低标注成本。
利用一一对应的描述-段关系实现弱监督。
开发句子定位与描述生成的双循环学习以实现端到端训练。
在 ActivityNet Captions 上展示在密集描述和句子定位方面的有效性。

提出的方法

构建两个对偶任务：句子定位 lθ1(V, C) 和事件描述 gθ2(V, S)。
在测试阶段使用固定点迭代收敛到有效段：S(t+1)=lθ1(V, gθ2(V, S(t))).
通过一个循环约束进行训练：C ≈ gθ2(V, lθ1(V, C))，以及一种去噪风格的损失以促进收敛。
应用 Crossing Attention 实现视频与描述特征之间的跨模态定位。
通过多锚点分类回归定位段落，然后在最佳锚点周围进行细化。
引入一个软裁剪机制，以实现对视频段的可微分描述生成。

实验结果

研究问题

RQ1在没有时间段标注的情况下，是否可以学习密集事件描述？
RQ2描述与段之间的双向、一一对应关系是否足以用于 WS-DEC 训练？
RQ3固定点迭代与去噪是否能够在弱监督下提升训练稳定性和性能？

主要发现

模型	ws	M	C	R	B@1	B@2	B@3	B@4
Krishna et al. (2017)	False	4.82	17.29	–	17.95	7.69	3.86	2.20
Yao et al.	False	7.71	16.08	13.27	17.50	9.62	5.54	3.38
Pretrained	True	4.58	10.45	9.27	8.70	3.39	1.50	0.69
Ours (no classification)	True	6.08	15.10	12.25	11.85	4.67	1.90	0.80
Ours (no regression)	True	6.11	17.66	12.40	11.98	5.45	2.69	1.44
Ours	True	6.30	18.77	12.55	12.41	5.50	2.62	1.27

在 ActivityNet Captions 上，WS-DEC 模型在 METEOR 和 CIDEr 分数方面与一些全监督方法相当。
所提出的方法在 Meteor 方面达到与全监督方法相当的水平，在弱监督变体中达到最佳 CIDEr 得分。
最终的 WS-DEC 模型（包含所有组件）在密集事件描述指标上优于无监督基线和消融变体。
定位结果显示，在弱监督下模型给出可行的段落预测，在某些指标（R@1 IoU=0.1 到 0.5，mIoU）上优于 CTRL，接近有监督基线。
测试时增加随机初始段的数量 Nr 能带来适度提升，但回报递减，表明对初始 proposal 的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。