Skip to main content
QUICK REVIEW

[论文解读] Long-term Temporal Convolutions for Action Recognition

Gül Varol, Ivan Laptev|arXiv (Cornell University)|Apr 15, 2016
Human Pose and Action Recognition参考文献 29被引用 137
一句话总结

本文在3D CNN中引入长期时序卷积(LTC),用于建模动作的扩展时间结构,在UCF101和HMDB51上取得了最先进的结果,特别是在将光流和RGB流与IDT特征结合时。

ABSTRACT

Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).

研究动机与目标

  • Motivate learning video representations that capture long-range spatio-temporal structure of actions lasting several seconds.
  • Investigate LTC (long-term temporal convolutions) to extend temporal extent while balancing spatial resolution and model complexity.
  • Assess the impact of different low-level representations, particularly high-quality optical flow, on action recognition.
  • Evaluate the benefit of data augmentation, pre-training, and multi-modal fusion (RGB, flow, and IDT) for LTC-based models.
  • Provide insights into how LTC learns temporal patterns and how it affects performance across datasets.

提出的方法

  • Propose a 3D CNN architecture with 5 spatio-temporal conv layers using 3x3x3 filters and progressive temporal extent.
  • Compare input configurations with 16 frames vs 60 frames and explore temporal extents up to 100 frames.
  • Evaluate RGB and flow inputs (MPEG flow, Farneback, Brox), and study the effect of optical flow quality on recognition.
  • Use data augmentation (random clipping, multiscale cropping) and dropout, with training from scratch or fine-tuning.
  • Train on UCF101 and HMDB51, report clip-level and video-level accuracies, and test with multi-crop and multi-clip averaging.
  • Investigate pre-training on large datasets (Sports-1M) for RGB networks and fine-tune on HMDB51, and explore late fusion of RGB and flow streams, including combining with IDT features.

实验结果

研究问题

  • RQ1How does increasing the temporal extent in 3D CNNs (LTC) affect action recognition performance?
  • RQ2What is the impact of input modality (RGB vs optical flow) and flow quality on LTC-based models?
  • RQ3What data augmentation strategies most improve LTC performance on limited data?
  • RQ4Does pre-training RGB networks on large datasets boost LTC performance when extending temporal extent?
  • RQ5Do combinations of multi-resolution LTCs and multi-modal inputs yield complementary gains over single-stream models?

主要发现

  • Long-term temporal convolutions significantly improve clip- and video-level accuracies over short-frame networks (e.g., 60f vs 16f).
  • Optical flow inputs, especially high-quality Brox flow, outperform RGB inputs for LTC-based action recognition.
  • Data augmentation (random clipping, multiscale cropping) and higher dropout substantially boost performance.
  • Pre-training RGB LTC networks on Sports-1M and then extending temporal extent yields notable gains on UCF101.
  • Combining flow and RGB LTC streams yields strong gains, and LTC Flow+RGB+IDT achieves state-of-the-art results on UCF101 (92.7%) and HMDB51 (67.2%).
  • Analysis of first-layer 3D filters shows LTC learns expressive spatio-temporal motion patterns, with higher-layer filters displaying increased class purity.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。