QUICK REVIEW

[Paper Review] Memory-augmented Dense Predictive Coding for Video Representation Learning

Tengda Han, Weidi Xie|arXiv (Cornell University)|Aug 3, 2020

Human Pose and Action Recognition60 references81 citations

TL;DR

MemDPC introduces a memory-augmented predictive coding framework for self-supervised video representation learning, enabling multiple future hypotheses via a compressive memory and predictive attention, and achieves state-of-the-art or competitive results on action recognition, retrieval, data-scarce learning, and unintentional action detection using only visual input.

ABSTRACT

The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condense representations, allowing to make multiple hypotheses efficiently. (ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.

Motivation & Objective

Motivate self-supervised video representation learning using only the visual stream.
Propose MemDPC, a memory-augmented dense predictive coding framework with a compressive memory for multi-hypothesis future prediction.
Evaluate MemDPC across action recognition, retrieval, data-scarce learning, and unintentional action detection to establish state-of-the-art or competitive results.

Proposed method

Partition video into blocks and extract per-block embeddings with a shared encoder f(.) to obtain z_i.
Aggregate block embeddings with a temporal model g(.) to form a context c_t summarizing past information.
Introduce a Compressive Memory M = {m_i} to enable multi-hypothesis future prediction via a predictive addressing mechanism p = softmax(φ(c_t)).
Predict future block representations ŷ_{t+1} as a convex combination of memory slots: ŷ_{t+1} = p_t+1 M, where p is learned by φ(.) (an MLP).
Train with a dense contrastive predictive loss that pushes similarity between (ŷ_{i,k}, z_{i,k}) for aligned future blocks higher than negatives across the batch and space-time locations.
Optionally extend MemDPC with two-stream inputs (RGB and optical flow) and bidirectional aggregation for improved representations.

Experimental results

Research questions

RQ1Can a memory-augmented predictive framework handle the inherent multi-hypothesis nature of future video frames in a self-supervised setting?
RQ2Does incorporating a compressive external memory improve predictive coding and downstream task performance relative to standard DPC?
RQ3What is the impact of using RGB, optical flow, or both on learned representations for downstream tasks?
RQ4How effective are linear vs. non-linear probes and end-to-end fine-tuning when evaluating self-supervised video representations?
RQ5How does MemDPC perform on action recognition, video retrieval, low-data learning, and unintentional action classification?

Key findings

MemDPC with compressive memory consistently outperforms or matches state-of-the-art self-supervised methods on several benchmarks using visual inputs only.
A memory size of 1024 often yields best UCF101 results in ablations.
Bidirectional aggregation and two-stream extensions (RGB+Flow) provide additional gains, with notable improvements on flow-based retrieval and action recognition.
On K400 pretraining, MemDPC achieves competitive UCF101 and HMDB51 accuracy under linear, non-linear, and full finetuning protocols, often surpassing methods using larger datasets or multi-modal inputs.
MemDPC demonstrates strong data efficiency, with representations enabling substantial improvements when labeled data is scarce.
In video retrieval, MemDPC with Flow dramatically improves R@k scores, and RGB+Flow fusion achieves leading performance among visual-only self-supervised methods.
On unintentional action classification (Oops dataset), MemDPC achieves state-of-the-art results, even with compact backbones and self-supervised pretraining.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.