[Paper Review] Memory-augmented Dense Predictive Coding for Video Representation Learning
MemDPC introduces a memory-augmented predictive coding framework for self-supervised video representation learning, enabling multiple future hypotheses via a compressive memory and predictive attention, and achieves state-of-the-art or competitive results on action recognition, retrieval, data-scarce learning, and unintentional action detection using only visual input.
The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condense representations, allowing to make multiple hypotheses efficiently. (ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
Motivation & Objective
- Motivate self-supervised video representation learning using only the visual stream.
- Propose MemDPC, a memory-augmented dense predictive coding framework with a compressive memory for multi-hypothesis future prediction.
- Evaluate MemDPC across action recognition, retrieval, data-scarce learning, and unintentional action detection to establish state-of-the-art or competitive results.
Proposed method
- Partition video into blocks and extract per-block embeddings with a shared encoder f(.) to obtain z_i.
- Aggregate block embeddings with a temporal model g(.) to form a context c_t summarizing past information.
- Introduce a Compressive Memory M = {m_i} to enable multi-hypothesis future prediction via a predictive addressing mechanism p = softmax(φ(c_t)).
- Predict future block representations ŷ_{t+1} as a convex combination of memory slots: ŷ_{t+1} = p_t+1 M, where p is learned by φ(.) (an MLP).
- Train with a dense contrastive predictive loss that pushes similarity between (ŷ_{i,k}, z_{i,k}) for aligned future blocks higher than negatives across the batch and space-time locations.
- Optionally extend MemDPC with two-stream inputs (RGB and optical flow) and bidirectional aggregation for improved representations.
Experimental results
Research questions
- RQ1Can a memory-augmented predictive framework handle the inherent multi-hypothesis nature of future video frames in a self-supervised setting?
- RQ2Does incorporating a compressive external memory improve predictive coding and downstream task performance relative to standard DPC?
- RQ3What is the impact of using RGB, optical flow, or both on learned representations for downstream tasks?
- RQ4How effective are linear vs. non-linear probes and end-to-end fine-tuning when evaluating self-supervised video representations?
- RQ5How does MemDPC perform on action recognition, video retrieval, low-data learning, and unintentional action classification?
Key findings
- MemDPC with compressive memory consistently outperforms or matches state-of-the-art self-supervised methods on several benchmarks using visual inputs only.
- A memory size of 1024 often yields best UCF101 results in ablations.
- Bidirectional aggregation and two-stream extensions (RGB+Flow) provide additional gains, with notable improvements on flow-based retrieval and action recognition.
- On K400 pretraining, MemDPC achieves competitive UCF101 and HMDB51 accuracy under linear, non-linear, and full finetuning protocols, often surpassing methods using larger datasets or multi-modal inputs.
- MemDPC demonstrates strong data efficiency, with representations enabling substantial improvements when labeled data is scarce.
- In video retrieval, MemDPC with Flow dramatically improves R@k scores, and RGB+Flow fusion achieves leading performance among visual-only self-supervised methods.
- On unintentional action classification (Oops dataset), MemDPC achieves state-of-the-art results, even with compact backbones and self-supervised pretraining.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.