QUICK REVIEW

[论文解读] SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Kevin Lin, Linjie Li|arXiv (Cornell University)|Nov 25, 2021

Multimodal Machine Learning Applications被引用 31

一句话总结

SwinBERT 是一个端到端的纯 Transformer 模型，用于视频字幕生成，它通过 Video Swin Transformer 处理原始视频帧，并结合多模态 Transformer 编码器，并通过一个可学习的稀疏注意力掩码来提升长序列视频建模。

ABSTRACT

The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e.g., video question answering). Moreover, to avoid the inherent redundancy in consecutive video frames, we propose adaptively learning a sparse attention mask and optimizing it for task-specific performance improvement through better long-range video sequence modeling. Through extensive experiments on 5 video captioning datasets, we show that SwinBERT achieves across-the-board performance improvements over previous methods, often by a large margin. The learned sparse attention masks in addition push the limit to new state of the arts, and can be transferred between different video lengths and between different datasets. Code is available at https://github.com/microsoft/SwinBERT

研究动机与目标

Motivate end-to-end video captioning without fixed-frame-rate feature extractors.
Propose a Video Swin Transformer encoder to handle variable-length video input from raw frames.
Introduce a learnable sparse attention mask to regularize long-range video sequence modeling.
Demonstrate substantial CIDEr improvements over prior state-of-the-art on multiple benchmarks.

提出的方法

Use Video Swin Transformer (VidSwin) to convert raw frames into video tokens.
Employ a multimodal Transformer encoder to generate captions from video tokens and word tokens.
Introduce a learnable sparse attention mask with a sparsity loss to focus on informative video tokens.
Train end-to-end with Masked Language Modeling integrated with the sparse attention loss.
Experiment with different frame counts to study effect of dense sampling on captioning performance.

实验结果

研究问题

RQ1Can an end-to-end Transformer-based model on raw video frames match or surpass multi-feature approaches for video captioning?
RQ2Does learnable sparse attention improve long-range video sequence modeling for caption generation?
RQ3How does frame density (number of frames) affect captioning performance across datasets?
RQ4Are sparse attention masks transferable across frame rates and datasets?

主要发现

SwinBERT achieves large CIDEr gains over prior SOTA on five datasets (e.g., MSVD, MSRVTT, TVC, VATEX).
Increasing the number of input frames (denser sampling) improves CIDEr scores.
The proposed sparse attention mask, with a sparsity constraint, improves performance and learns to focus on salient video tokens.
Sparse masks are transferable across frame rates and can transfer across datasets with fine-tuning.
Binary or soft sparse masks offer comparable performance to full attention with potential runtime benefits.
Visualization shows the model prioritizes center-region tokens with more motion while sparsely attending to boundaries.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。