Skip to main content
QUICK REVIEW

[Paper Review] Hidden Two-Stream Convolutional Networks for Action Recognition

Yi Zhu, Zhenzhong Lan|arXiv (Cornell University)|Apr 2, 2017
Human Pose and Action Recognition30 references99 citations
TL;DR

The paper introduces Hidden Two-Stream Networks that learn motion representations directly from raw frames via MotionNet in an end-to-end framework, achieving real-time action recognition without pre-computing optical flow. It demonstrates competitive accuracy across four datasets and is substantially faster than two-stage baselines.

ABSTRACT

Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Such a two-stage approach is computationally expensive, storage demanding, and not end-to-end trainable. In this paper, we present a novel CNN architecture that implicitly captures motion information between adjacent frames. We name our approach hidden two-stream CNNs because it only takes raw video frames as input and directly predicts action classes without explicitly computing optical flow. Our end-to-end approach is 10x faster than its two-stage baseline. Experimental results on four challenging action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show that our approach significantly outperforms the previous best real-time approaches.

Motivation & Objective

  • Motivate end-to-end learning of motion representations for action recognition to avoid costly optical flow pre-computation.
  • Introduce MotionNet to unsupervisedly learn optical-flow-like motion from frame pairs.
  • Stack MotionNet with a temporal CNN and train end-to-end for action classification.
  • Demonstrate improved efficiency and competitive accuracy across standard benchmarks.

Proposed method

  • Propose MotionNet, a fully convolutional network that learns frame-to-frame motion by reconstructing one frame from another using backward warping.
  • Train MotionNet with unsupervised multi-scale losses: pixel reconstruction, smoothness, and SSIM-based perceptual loss.
  • Clip, normalize, and quantize predicted flow to feed a temporal stream CNN, enabling end-to-end stacking.
  • Compare stacking versus branching; implement stacking to project motion features to action labels.
  • Fuse predictions from the temporal motion stream and a spatial stream in a hidden two-stream architecture.
  • Evaluate on four datasets (UCF101, HMDB51, THUMOS14, ActivityNet) with standard splits and data augmentations.

Experimental results

Research questions

  • RQ1Can motion information be learned end-to-end from raw frames without explicit optical flow pre-computation?
  • RQ2Does unsupervised MotionNet learning improve action recognition when stacked with a temporal CNN?
  • RQ3Is end-to-end training with multi-task objectives (including unsupervised losses) beneficial for action recognition?
  • RQ4How does hidden two-stream fusion compare to traditional two-stream methods in accuracy and speed?

Key findings

  • MotionNet, trained unsupervised, provides competitive optical-flow-like representations and, when stacked with a temporal CNN, yields strong action recognition performance.
  • End-to-end hidden two-stream networks are around 10x faster than two-stage baselines due to on-the-fly motion estimation and no flow storage.
  • The stacked temporal stream with MotionNet, when fused with a spatial stream, achieves improved accuracy over single-stream baselines.
  • End-to-end fine-tuning with unsupervised and action losses yields the best recognition results among tested configurations.
  • MotionNet shows robustness and generalization, performing competitively on optical-flow benchmarks while delivering strong action recognition results.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.