[Paper Review] Learning Video Representations using Contrastive Bidirectional Transformer
The paper introduces Contrastive Bidirectional Transformer (CBT) to learn self-supervised video representations from sequences of real-valued frame features, with optional cross-modal training from ASR text, achieving state-of-the-art results on video classification, captioning, and segmentation.
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods. Our method extends the BERT model for text sequences to the case of sequences of real-valued feature vectors, by replacing the softmax loss with noise contrastive estimation (NCE). We also show how to learn representations from sequences of visual features and sequences of words derived from ASR (automatic speech recognition), and show that such cross-modal training (when possible) helps even more.
Motivation & Objective
- Motivate learning robust video representations without labels for downstream tasks like classification, captioning, and segmentation.
- Adapt BERT-style bidirectional context modeling to sequences of real-valued video features using contrastive loss.
- Explore cross-modal training by jointly leveraging ASR-derived tokens to maximize mutual information with video features.
- Demonstrate improvements over prior self-supervised methods on standard benchmarks (e.g., UCF101, HMDB51) and longer temporal representations.
Proposed method
- Extend BERT-style pretraining to sequences of real-valued video features using a noise-contrastive estimation (NCE) objective.
- Encode short windows of frames with an S3D CNN to produce frame-level features, then apply a bidirectional transformer as the context predictor.
- Use NCE to maximize the predictability of a masked frame feature given its context, encouraging bidirectional temporal representations.
- Introduce a cross-modal transformer to maximize mutual information between video features and optional ASR text tokens, aggregating at the sequence level rather than frame-level alignment.
- Combine three losses in a unified objective: L_cbt = w_bert L_bert (pretrained, frozen) + w_visual L_visual + w_cross L_cross; with w_bert fixed at 0 in practice, w_visual = 1, and w_cross either 1 or 0 depending on cross-modal training.
- Evaluate with visual-only CBT pretraining on Kinetics and HowTo100M, followed by linear probing or fine-tuning on downstream tasks like action recognition, captioning, and segmentation.
Experimental results
Research questions
- RQ1How effectively can a BERT-style bidirectional transformer be trained on sequences of real-valued video features using a contrastive objective?
- RQ2Does incorporating cross-modal signals from ASR improve learned video representations, especially under imperfect alignment between video and text?
- RQ3What is the impact of self-supervised CBT pretraining on short-term action recognition versus longer-term temporal representations?
- RQ4How do the learned representations transfer to downstream tasks such as video classification, segmentation, and captioning compared to prior self-supervised methods?
Key findings
- CBT-based self-supervised learning substantially improves action recognition on UCF101 and HMDB51 compared to prior methods when fine-tuned (e.g., UCF101 79.5 vs. 75.3 and HMDB51 44.5 vs. 40.0 with similar baselines).
- Cross-modal pretraining with ASR signals yields further gains on smaller datasets for action anticipation tasks, and improves temporal representations learned from HowTo100M.
- CBT outperforms prior self-supervised methods by leveraging a transformer-based context model over sequences of real-valued frame features, avoiding vector quantization that can lose fine-grained information.
- Temporal representations learned via CBT scale to longer sequences, showing superior performance over baselines like average pooling and LSTM as video length increases.
- For captioning and segmentation, CBT-based representations yield higher language and frame-labeling metrics (e.g., BLEU-4, METEOR, ROUGE-L, CIDEr) and competitive frame tagging performance on COIN and YouCook2 datasets.
- Compared to VideoBERT and other approaches, CBT achieves strong results without requiring discrete visual tokens, benefiting from direct real-valued feature modeling and cross-modal mutual information.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.