Skip to main content
QUICK REVIEW

[Paper Review] SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Kevin Lin, Linjie Li|arXiv (Cornell University)|Nov 25, 2021
Multimodal Machine Learning Applications31 citations
TL;DR

SwinBERT is an end-to-end pure Transformer model for video captioning that processes raw video frames with a Video Swin Transformer and a multimodal Transformer encoder, augmented by a learnable sparse attention mask to improve long-range video sequence modeling.

ABSTRACT

The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e.g., video question answering). Moreover, to avoid the inherent redundancy in consecutive video frames, we propose adaptively learning a sparse attention mask and optimizing it for task-specific performance improvement through better long-range video sequence modeling. Through extensive experiments on 5 video captioning datasets, we show that SwinBERT achieves across-the-board performance improvements over previous methods, often by a large margin. The learned sparse attention masks in addition push the limit to new state of the arts, and can be transferred between different video lengths and between different datasets. Code is available at https://github.com/microsoft/SwinBERT

Motivation & Objective

  • Motivate end-to-end video captioning without fixed-frame-rate feature extractors.
  • Propose a Video Swin Transformer encoder to handle variable-length video input from raw frames.
  • Introduce a learnable sparse attention mask to regularize long-range video sequence modeling.
  • Demonstrate substantial CIDEr improvements over prior state-of-the-art on multiple benchmarks.

Proposed method

  • Use Video Swin Transformer (VidSwin) to convert raw frames into video tokens.
  • Employ a multimodal Transformer encoder to generate captions from video tokens and word tokens.
  • Introduce a learnable sparse attention mask with a sparsity loss to focus on informative video tokens.
  • Train end-to-end with Masked Language Modeling integrated with the sparse attention loss.
  • Experiment with different frame counts to study effect of dense sampling on captioning performance.

Experimental results

Research questions

  • RQ1Can an end-to-end Transformer-based model on raw video frames match or surpass multi-feature approaches for video captioning?
  • RQ2Does learnable sparse attention improve long-range video sequence modeling for caption generation?
  • RQ3How does frame density (number of frames) affect captioning performance across datasets?
  • RQ4Are sparse attention masks transferable across frame rates and datasets?

Key findings

  • SwinBERT achieves large CIDEr gains over prior SOTA on five datasets (e.g., MSVD, MSRVTT, TVC, VATEX).
  • Increasing the number of input frames (denser sampling) improves CIDEr scores.
  • The proposed sparse attention mask, with a sparsity constraint, improves performance and learns to focus on salient video tokens.
  • Sparse masks are transferable across frame rates and can transfer across datasets with fine-tuning.
  • Binary or soft sparse masks offer comparable performance to full attention with potential runtime benefits.
  • Visualization shows the model prioritizes center-region tokens with more motion while sparsely attending to boundaries.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.