QUICK REVIEW

[논문 리뷰] Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer, Axel Pinz|arXiv (Cornell University)|2016. 04. 22.

Human Pose and Action Recognition참고 문헌 31인용 수 371

한 줄 요약

이 논문은 시공간 ConvNet 스트림을 다양한 융합 전략으로 결합하는 시공-시간 융합 아키텍처를 제시하고, UCF101 및 HMDB51에서 최첨단 결과를 보여주며, 3D 시간 풀링을 활용한 후단의 공간 융합이 단순 소프트맥스 융합보다 파라미터 수가 적은 강력한 성능을 낸다.

ABSTRACT

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

연구 동기 및 목표

Motivate and improve action recognition by effectively fusing appearance (spatial) and motion (temporal) cues from video.
Investigate where, how, and how often to fuse two ConvNet streams to maximize spatiotemporal feature learning.
Develop a practical spatiotemporal fusion architecture that preserves spatial correspondence while leveraging temporal context.
Compare fusion strategies and depths to understand their impact on accuracy and model size.

제안 방법

Evaluate multiple fusion functions (sum, max, concatenation, conv, bilinear) to combine two streams at chosen layers.
Experiment with fusion locations (after various conv layers, FC layers, or multi-layer fusion) under the constraint of matching spatial dimensions.
Implement temporal fusion via 2D/3D pooling and 3D convolutions to capture short-term and long-term temporal structure.
Propose a spatiotemporal fusion architecture that fuses at the last convolutional layer with 3D conv fusion and 3D pooling, while preserving the temporal stream.
Train two-stream networks (spatial: RGB, temporal: optical flow) pretrained on ImageNet, then finetune on UCF101 and HMDB51; evaluate with dense temporal sampling at test time.

실험 결과

연구 질문

RQ1What fusion strategy between spatial and temporal streams yields the best action recognition accuracy?
RQ2Where in the network should fusion occur to maximize performance while minimizing parameters?
RQ3How should temporal information be fused to capture short-term and long-term dynamics effectively?
RQ4Does using deeper networks (e.g., VGG-16) improve action recognition more than deeper temporal models?
RQ5How does spatiotemporal fusion compare to single-stream or late fusion baselines on standard benchmarks?

주요 결과

Conv fusion at the last convolutional layer (ReLU5) outperforms other spatial fusion layers and is competitive with or better than late fusion at the softmax layer, with substantially fewer parameters.
Concatenation and max fusion generally underperform compared to sum or conv fusion for spatial fusion, and conv fusion provides the best accuracy in many settings.
Fusing the two streams at ReLU5 and using 3D fusion then 3D pooling improves performance versus 2D pooling, and keeps the architectural benefits of explicit spatiotemporal correspondence.
In deeper models (VGG-16) for both streams, accuracy improves notably for spatial models, while temporal gains are smaller, indicating stronger benefits from spatial depth.
Temporal fusion using 3D convs and 3D pooling yields higher accuracy than plain 2D fusion or simple averaging of predictions; a 3D fusion filter further boosts performance on benchmarks.
The proposed 3D spatiotemporal fusion architecture achieves state-of-the-art results on UCF101 and HMDB51 compared with prior two-stream methods.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.