QUICK REVIEW

[论文解读] Tripping through time: Efficient Localization of Activities in Videos

Meera Hahn, Asim Kadav|arXiv (Cornell University)|Apr 22, 2019

Multimodal Machine Learning Applications参考文献 20被引用 41

一句话总结

TripNet 通过在未剪辑的视频中定位瞬间，利用门控注意力表示和基于强化学习的搜索，仅探查视频的32-41%，实现强准确性。

ABSTRACT

Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications of this approach, such as video surveillance, efficiency is a key system requirement. In this paper, we present TripNet, an end-to-end system that uses a gated attention architecture to model fine-grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to intelligently skip around the video. It extracts visual features for few frames to perform activity classification. In our evaluation over Charades-STA, ActivityNet Captions and the TACoS dataset, we find that TripNet achieves high accuracy and saves processing time by only looking at 32-41% of the entire video.

研究动机与目标

解决在长时间未剪辑视频中用自然语言描述的行动的时间定位挑战。
开发一个端到端框架，将语言定位到细粒度的视频特征。
通过学习一个能够智能跳过非关键帧的策略来提高效率。

提出的方法

提出 TripNet，使用门控注意力状态表示，将语言查询与视频特征对齐。
使用 actor-critic 强化学习框架（A3C）学习在视频上移动固定大小候选窗口的策略。
定义一个离散动作空间，通过预定义的帧步长跳转窗口，并设定 TERMINATE 动作输出当前窗口。
结合一个奖励，它将 IOU 的提升与对步骤数量的小惩罚相结合，以鼓励效率。
端到端训练模型，使视觉和文本模态在策略学习前进行融合。
将门控注意 TripNet 与串联基线 TripNet-Concat 进行比较，以证明门控注意的好处。

实验结果

研究问题

RQ1TripNet 是否能够在长视频中准确定位由自然语言描述的瞬间？
RQ2门控注意力融合模型是否比简单特征拼接在定位精度上有提升？
RQ3在仍能获得较强定位性能的情况下，可以跳过多少视频帧？
RQ4在标准基准测试上，TripNet 与现有 TALL 方法在准确性和效率方面的比较如何？

主要发现

TripNet 在 Charades-STA、ActivityNet Captions 和 TACoS 数据集上达到最先进或具有竞争力的准确性。
TripNet Localizes 仅在平均探查32-41%的视频时就能定位瞬间，显著提升效率。
TripNet-GA（门控注意）优于 TripNet-Concat，展示了多模态门控融合的有效性。
在 Charades-STA 和 TACoS 上，TripNet 获得领先于此前方法；在 ActivityNet Captions 上，与最先进水平相当。
该方法通过避免对每帧进行穷尽分析，同时保持高定位准确性，降低了整体视频处理时间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。