[Paper Review] What have we learned from deep representations for action recognition?
This paper introduces spatiotemporally regularized activation maximization to visualize deep two-stream video action recognition models, revealing that they learn distributed, class-specific spatiotemporal features combining appearance and motion. The key contribution is the first visualization of hierarchical motion representations, showing cross-stream fusion enables true spatiotemporal feature learning and exposes both model strengths and dataset biases.
As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video. We show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncracies of training data and to explain failure cases of the system.
Motivation & Objective
- To understand what deep spatiotemporal representations in video action recognition models actually learn, as their compositional structure makes internal reasoning difficult.
- To develop a method for visualizing internal features without relying on specific input samples, avoiding bias from training data.
- To investigate how appearance and motion pathways interact in two-stream networks and whether fusion leads to true spatiotemporal features.
- To use visualizations to diagnose model failures and uncover hidden dataset biases in benchmark datasets like UCF101.
Proposed method
- Proposes spatiotemporally regularized activation maximization by backpropagating gradients on input to find stimuli that maximize unit activation.
- Applies gradient ascent to optimize synthetic inputs (from white noise) that maximize filter responses in both appearance and motion branches of a two-stream network.
- Uses regularization to enforce spatiotemporal consistency, ensuring visualizations reflect plausible video-like patterns rather than artifacts.
- Visualizes features across multiple layers of the VGG-16 Two-Stream Fusion model to analyze hierarchical abstraction and invariance.
- Compares visualizations across different temporal regularization levels (χ) to assess robustness to motion speed and pattern variation.
- Analyzes class prediction units by maximizing their output to reveal what features drive specific action classifications.
Experimental results
Research questions
- RQ1What kind of spatiotemporal features do deep two-stream networks learn for action recognition?
- RQ2Does cross-stream fusion lead to true spatiotemporal representations, or merely separate appearance and motion features?
- RQ3How do learned features vary in specificity—do they capture class-specific patterns or generic motion/appearance cues?
- RQ4To what extent do visualizations reveal dataset biases or failure modes in action recognition models?
- RQ5Can visualizations expose subtle differences between confusing action classes, such as PlayingCello vs. PlayingViolin?
Key findings
- Cross-stream fusion enables the learning of true spatiotemporal features, such as a filter activated by colored blobs in appearance and moving circular regions in motion, which together support recognition of actions like Billiards.
- The network learns both highly class-specific features (e.g. barbells and body motion for CleanAndJerk) and generic representations (e.g. limbs and motion patterns) that generalize across classes.
- As features progress through the network hierarchy, they become more abstract and invariant to irrelevant variations, such as motion speed, indicating progressive abstraction.
- Visualizations reveal that confusion between PlayingCello and PlayingViolin arises because the model focuses on instrument alignment (horizontal vs. vertical), not fine details like bowing technique.
- Confusion between BrushingTeeth and ShavingBeard stems from shared local motion and appearance of tools near the face, with the model failing to distinguish subtle differences in tool motion and facial structure.
- The model distinguishes ApplyEyeMakeup and ApplyLipstick partly by detecting eye movement in the latter, revealing a dataset peculiarity where eyes are often static in the former class.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.