QUICK REVIEW

[論文レビュー] Video Swin Transformer

Ze Liu, Ning Jia|arXiv (Cornell University)|Jun 24, 2021

Human Pose and Action Recognition参考文献 38被引用数 46

ひとこと要約

本論文は Video Swin Transformer を導入する。動画認識の純粋なトランスフォーマー Backbone で、3D shifted window attention による時空的局所性を用い、主要な動画ベンチマークで最先端の結果を達成しつつ、より効率的で画像事前学習との互換性を持つ。

ABSTRACT

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2). The code and models will be made publicly available at https://github.com/SwinTransformer/Video-Swin-Transformer.

研究の動機と目的

動画における局所性バイアスを持つ Transformer バックボーンを動機づけ、速度と精度のトレードオフを改善する。
Swin Transformer の帰納的バイアス（局所性、階層性、平移不変性）を時空間的設定で活用する。
局所性ベースの動画トランスフォーマーが、グローバルな自己注意モデルより少ない計算量とデータで上回ることを示す。
画像事前学習モデルを用いて、複数の動画認識ベンチマークで最先端の性能を示す。

提案手法

Swin Transformer を動画に適用するため、非重複の局所注意を2Dから3Dウィンドウへ拡張する。
3D shifted window に基づくマルチヘッド自己注意（3DW-MSA）を実装し、効率を維持しつつウィンドウ間の結合を可能にする。
局所的な3Dウィンドウ内の注意を強化するために3D相対位置バイアスを組み込む。
パッチのマージを伴う階層的アーキテクチャを維持し、時間方向のダウンサンプリングを行わず、多スケールの動画表現を可能にする。
ImageNet事前学習モデルからの初期化戦略を探求し、バックボーンとヘッドの学習率比を分析して一般化を向上させる。

実験結果

リサーチクエスチョン

RQ13D shifted window attention による時空間的局所性は、動画トランスフォーマーにおいてグローバル自己注意を効率的に近似できるか？
RQ2最新手法と比較して、アクション認識および時間モデリングのベンチマークで Video Swin Transformer はどのように性能を発揮するか？
RQ3画像事前学習モデルを動画バックボーンに最も活用できる初期化と最適化戦略は何か？
RQ4時間軸の次元、ウィンドウ設計、学習率スケジューリングが性能に与える影響を、どのようなアブレーションで明らかにするか？

主な発見

ViViT-H と比較して ~20x の事前学習データ、 ~3x 小さいモデルサイズで、Kinetics-400 で 84.9% top-1、Kinetics-600 で 86.1% top-1 を達成。
Something-Something v2 で 69.6% top-1 を達成し、強力な時間モデリングを示す。
結合した時空間局所性（3D W-MSA）は、アブレーション設計（joint vs split vs factorized）の間で最良の速度-精度のトレードオフを提供する。
3D shifted window 戦略と相対位置バイアスは、ウィンドウ間の接続と性能向上に寄与する。
ImageNet-21K での事前学習と慎重なバックボーン学習率スケーリングは、一般化と効率を改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。