QUICK REVIEW

[论文解读] Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Rui Hou, Chen Chen|arXiv (Cornell University)|Mar 30, 2017

Human Pose and Action Recognition参考文献 24被引用 58

一句话总结

一个端到端的3D-CNN框架（T-CNN），通过生成并链接3D管道提案，利用Tube Proposal Network和Tube-of-Interest pooling进行视频中的时空动作检测与定位。

ABSTRACT

Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has been limited due to complexity of video data and lack of annotations. Previous convolutional neural networks (CNN) based video action detection approaches usually consist of two major steps: frame-level action proposal detection and association of proposals across frames. Also, these methods employ two-stream CNN framework to handle spatial and temporal feature separately. In this paper, we propose an end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for action detection in videos. The proposed architecture is a unified network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and for each clip a set of tube proposals are generated next based on 3D Convolutional Network (ConvNet) features. Finally, the tube proposals of different clips are linked together employing network flow and spatio-temporal action detection is performed using these linked video proposals. Extensive experiments on several video datasets demonstrate the superior performance of T-CNN for classifying and localizing actions in both trimmed and untrimmed videos compared to state-of-the-arts.

研究动机与目标

促成在视频中实现端到端的时空动作检测的需求。
提出一个统一的3D-CNN框架，直接对视频片段进行定位和识别动作。
介绍一个 Tube Proposal Network (TPN)，从3D特征生成管道提案。
开发 Tube-of-Interest (ToI) pooling，产生固定长度的描述符以适应可变的管道提案。
在修剪和未修剪的视频数据集上展示最先进的性能。

提出的方法

用3D ConvNet处理视频片段以提取时空特征立方体。
使用 Tube Proposal Network (TPN) 生成每个片段的管道提案，带有动作性评分并通过k-means学习锚框。
通过相邻片段之间的动作性和重叠度评分以及网络流来连接管道提案。
应用 Tube-of-Interest (ToI) pooling，从连接的管道提案中获取用于动作分类的固定长度特征。
端到端训练，TPN与识别网络交替更新，使用1x1卷积匹配维度，最终全连接层用于边框回归和动作分类。
使用时间跳跃池化，通过将 conv5 的提案映射到每个片段八帧中的 conv2 特征管道来保持帧顺序信息。

实验结果

研究问题

RQ1一个端到端的3D CNN框架是否能够直接从视频输入中学习定位和识别动作，而不依赖双流或基于帧的提案？
RQ2带有数据驱动锚框的 Tube Proposal Network 是否比基于帧的提案在时空动作定位上有改进？
RQ3ToI pooling 是否能够有效为可变长度的管道生成固定长度的描述符，从而实现鲁棒的动作分类？
RQ4时间跳跃池化是否保留时间顺序信息并提高定位准确性？
RQ5在多数据集上，T-CNN 在修剪和未修剪视频上的表现如何？

主要发现

T-CNN 在修剪数据集 UCF-Sports、J-HMDB、UCF-101 以及未修剪的 THUMOS’14 数据集上均达到最先进的性能。
使用基于3D ConvNet的管道提案和 ToI pooling 提升了动作定位和识别。
时间跳跃池化维持时间顺序信息，提升定位准确性。
一个端到端的方法，基于3D体积并通过k-means学习的锚点，优于依赖帧级提案或双流结构的方法。
该方法在动作识别精度方面表现强劲：在 UCF-Sports 上 95.7%，在 J-HMDB 上 67.2%，在 UCF-101（24 个动作）上 94.4%。
在 THUMOS’14 上，负样本挖掘进一步提升了性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。