[Paper Review] Beyond Short Snippets: Deep Networks for Video Classification
This paper proposes deep neural network architectures—specifically temporal feature pooling and LSTM-based models—that leverage full-length video clips (up to 120 frames, ~2 minutes) to improve video classification. By combining CNN-processed frame features with optical flow and modeling long-range temporal dependencies, the method achieves state-of-the-art performance on UCF-101 (88.6%) and Sports-1M (73.1%) benchmarks, significantly outperforming prior approaches using short snippets.
Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 72.8%).
Motivation & Objective
- To improve video classification by modeling long-range temporal dependencies across full-length videos, rather than relying on short video snippets.
- To investigate whether global video-level representations, learned from extended frame sequences, yield better performance than frame-level aggregation.
- To evaluate the effectiveness of optical flow as explicit motion encoding when combined with deep learning architectures.
- To compare the performance of feature-pooling and recurrent architectures (LSTM) in capturing temporal evolution in videos.
- To determine whether low frame rates (1 fps) can be used effectively when paired with optical flow to reduce computational cost while preserving accuracy.
Proposed method
- Processes video frames at 1 fps to reduce computational cost while preserving temporal context, using only one frame per second to maintain efficiency.
- Employs a 2D CNN to extract spatial features from each frame, followed by temporal feature pooling (e.g., max-pooling) to aggregate frame-level features into a global video descriptor.
- Uses Long Short-Term Memory (LSTM) networks to model sequential dependencies across frames, with the LSTM hidden state evolving over time to capture long-range temporal dynamics.
- Combines image frame features with optical flow maps as input to both the pooling and LSTM models to explicitly encode motion information.
- Trains models by progressively expanding smaller networks and fine-tuning, enabling end-to-end learning on full-length videos without requiring short clips.
- Applies backpropagation through time in the top-level LSTM layers, but not through the CNN layers, limiting gradient flow to the recurrent component.
Experimental results
Research questions
- RQ1Can deep neural networks trained on full-length videos (up to 120 frames) significantly improve video classification accuracy compared to models using only short video snippets?
- RQ2Does the use of optical flow as explicit motion encoding enhance performance, especially when combined with recurrent architectures like LSTMs?
- RQ3How does reducing the frame rate to 1 fps affect classification performance when optical flow is used to preserve motion information?
- RQ4Do recurrent models (LSTM) outperform simple temporal pooling methods in capturing long-range temporal dependencies in video sequences?
- RQ5Is the benefit of optical flow dependent on video quality, and does it remain effective in noisy or untrimmed videos such as those in the Sports-1M dataset?
Key findings
- The proposed LSTM-based model achieves 88.6% accuracy on UCF-101, surpassing the previous state-of-the-art of 88.0% using two-stream fusion with SVM.
- The model using 120 frames and optical flow achieves 88.2% accuracy on UCF-101, significantly outperforming the 73.0% accuracy of a single-frame CNN baseline.
- On the Sports-1M dataset, the LSTM model with optical flow achieves 73.1% accuracy, a substantial improvement over the prior state-of-the-art of 60.9%.
- Optical flow provides a larger performance gain on UCF-101 (82.6% vs. 88.2%) than on Sports-1M, due to better video quality and more consistent action content.
- Even with noisy optical flow maps, the LSTM model can still benefit from motion information, demonstrating robustness to low-quality motion features.
- Lower frame rates (1 fps) do not degrade performance when combined with optical flow, as long as sufficient temporal context is preserved.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.