QUICK REVIEW

[论文解读] Video Swin Transformer

Ze Liu, Ning Jia|arXiv (Cornell University)|Jun 24, 2021

Human Pose and Action Recognition参考文献 38被引用 46

一句话总结

本文提出 Video Swin Transformer，一种用于视频识别的纯 Transformer 主干，通过使用 3D 偏移窗口注意力来实现时空局部性，在关键视频基准上取得了最先进的结果，同时更高效且与图像预训练兼容。

ABSTRACT

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2). The code and models will be made publicly available at https://github.com/SwinTransformer/Video-Swin-Transformer.

研究动机与目标

为视频设计一个具有局部性偏向的 Transformer 主干，以改善速度—精度权衡。
在时空设置中利用 Swin Transformer 的归纳偏置（局部性、层次结构、平移不变性）。
证明基于局部性的在线视频 Transformer 在运算量和数据需求更少的情况下，能够超越全局注意力模型。
在多项视频识别基准上展示使用预训练图像模型的最先进性能。

提出的方法

通过将非重叠局部注意力从 2D 扩展到 3D 窗口来将 Swin Transformer 适配到视频。
实现基于 3D 偏移窗口的多头自注意力（3DW-MSA），在保持效率的同时实现跨窗口连接。
引入 3D 相对位置偏置，以增强在局部 3D 窗口内的注意力。
维持分层架构，进行补丁合并并且不进行时间维降采样，从而实现多尺度视频表示。
探索来自 ImageNet 预训练模型的初始化策略，并分析骨干网络和头部的学习率比以实现更好的泛化。

实验结果

研究问题

RQ1通过 3D 偏移窗口注意力实现的时空局部性，是否能在视频 Transformer 中高效地逼近全局自注意力？
RQ2与最先进方法相比，Video Swin Transformer 在动作识别和时序建模基准上的表现如何？
RQ3哪些初始化和优化策略最能利用预训练的图像模型来构建视频骨干网？
RQ4对时间维度、窗口设计和学习率调度的消融研究对性能的影响是什么？

主要发现

在 Kinetics-400 上达到 84.9% top-1，在 Kinetics-600 上达到 86.1% top-1，相比 ViViT-H，预训练数据量减少约 20 倍，模型规模约小 3 倍。
在 Something-Something v2 上达到 69.6% top-1，体现了强大的时序建模能力。
联合时空局部性（3D W-MSA）在消融设计中提供最佳的速度-精度权衡（联合 vs 分割 vs 分解）。
3D 偏移窗口策略和相对位置偏置有助于跨窗口连接并提升性能。
在 ImageNet-21K 上的预训练以及对骨干网络学习率进行谨慎缩放，提升了泛化性与效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。