QUICK REVIEW

[论文解读] Temporal Segment Networks for Action Recognition in Videos

Limin Wang, Yuanjun Xiong|arXiv (Cornell University)|May 8, 2017

Human Pose and Action Recognition参考文献 59被引用 53

一句话总结

引入 Temporal Segment Networks (TSN) 通过稀疏段采样和段落一致性建模视频的长程时序结构，在多个动作识别基准上达到最先进水平，并实现基于 RGB-diff 的实时运动。

ABSTRACT

Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based sampling and aggregation module. This unique design enables our TSN to efficiently learn action models by using the whole action videos. The learned models could be easily adapted for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the instantiation of TSN framework given limited training samples. Our approach obtains the state-the-of-art performance on four challenging action recognition benchmarks: HMDB51 (71.0%), UCF101 (94.9%), THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%). Using the proposed RGB difference for motion models, our method can still achieve competitive accuracy on UCF101 (91.0%) while running at 340 FPS. Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

研究动机与目标

在视频中建模用于动作识别的长程时序结构。
开发一个基于视频级的框架，使用稀疏采样来处理整段视频。
通过分层聚合使 TSN 同时适用于裁剪视频和未裁剪视频。
总结在数据有限的情况下训练深度动作模型的最佳实践。

提出的方法

将视频分成 K 个段并在每个段中采样一个片段。
使用共享的卷积神经网络处理每个片段以获得片段分数。
使用灵活的一致性函数对片段分数进行聚合（最大值、平均值、Top-K、加权、注意力机制）。
使用多尺度时间窗积分（M-TWI）将模型应用于未裁剪的视频。
探索跨模态初始化和部分批量归一化以在有限数据下改进训练。
尝试包括 RGB、光流、RGB 差分和扭曲流在内的模态。

实验结果

研究问题

RQ1如何在采用轻量采样策略的情况下，有效捕捉视频中的长程时序结构以进行动作识别？
RQ2基于段的聚合框架是否能够在裁剪和未裁剪的视频中实现准确识别？
RQ3在数据有限的情况下，哪些输入模态与训练实践能最好地提升性能？
RQ4不同聚合策略如何影响视频级预测和训练动态？
RQ5跨模态初始化和部分 BN 对模型性能的影响是什么？

主要发现

在 HMDB51 (71.0%)、UCF101 (94.9%)、THUMOS14 (80.1%) 和 ActivityNet v1.2 (89.6%) 上取得最先进的准确率。
以 RGB-差分作为运动输入在 UCF101 上可达到 91.0%，运行速度为 340 FPS。
该框架支持裁剪灵活性，并通过多尺度时间窗积分在未裁剪视频上取得强结果。
引入五种聚合函数，且表明 Top-K 池化与注意力加权可提高对背景的鲁棒性。
验证跨模态初始化和部分 BN 有助于在行动识别任务中利用有限数据训练深度模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。