QUICK REVIEW

[論文レビュー] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Kunchang Li, Yali Wang|arXiv (Cornell University)|Nov 17, 2022

Visual Attention and Saliency Detection被引用数 58

ひとこと要約

UniFormerV2は画像事前学習済みのViTと簡潔なUniFormerビデオ設計を組み合わせて時空表現を学習し、8つのビデオベンチマークで最先端の結果を達成し、Kinetics-400で90.0%のtop-1を達成します。

ABSTRACT

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. Code will be available at https://github.com/OpenGVLab/UniFormerV2.

研究の動機と目的

オープンイメージ事前学習済みViTをUniFormer風のビデオブロックで武装させることにより、強力なビデオモデルを構築する実用的なパラダイムを提案する。
精度と計算量のバランスを取るための局所およびグローバルな関係集約器を設計する。
マルチスケールの時空表現を統合するためのマルチステージ融合を有効にする。
多様なベンチマーク（Kinetics-400/600/700、Moments in Time、Something-Something V1/V2、ActivityNet、HACS）でアプローチを検証する。
統一的なポスト・プレトレーニングベンチマーク（Kinetics-710）で有効性を示す。

提案手法

ViTブロックの前に局所的な時間的MHRAを挿入して、事前学習済みの空間的特徴を活用しつつ時間的冗長性を低減する局所的なUniBlockを導入する。
各局所ブロックの上にグローバル UniBlockを追加し、クエリベースのクロス MHRA を用いてトークンを動画トークンへ要約し、線形時間計算量で処理する。
複数のステージからのグローバルトークンを最終的なビデオ表現へ統合するためのマルチステージ融合ブロックを採用する。
UniFormer からMHRAを再利用・適応し、効率的な時空モデリングのために局所 LT_MHRA とグローバル GS_MHRA を使用する。
.Inputを3D畳み込みで時空トークンとして射影し、時間方向にダウンサンプリングし、局所およびグローバル UniBlocks を適用し、マルチステージ出力を融合し、クラス Token との最終融合を任意とする。
ステージ間でグローバルトークンを結合するための4つの融合戦略（Sequential、Parallel、Hierarchical KV、Hierarchical Q）を検討する。

実験結果

リサーチクエスチョン

RQ1画像事前学習済みのViTをUniFormer風のビデオ設計と効果的に組み合わせて、時空学習を改善できるか。
RQ2標準ベンチマークで、従来のビデオモデルと比較したときのUniFormerV2の精度と効率のトレードオフは何か。
RQ3グローバルトークンのマルチステージ融合は最終的なビデオ表現にどう影響するか。
RQ4統一されたKinetics-710ベンチマークでのポストプレトレーニングは、Kinetics-400/600/700、MiT、その他のデータセット全体で一貫した向上をもたらすか。
RQ5提案されたクロス MHRA グローバルブロックは、計算効率を保ちつつ性能を維持または向上させるか。

主な発見

Kinetics-400/600/700、Moments in Time、Something-Something V1/V2、ActivityNet、HACS を含む8つの人気ビデオベンチマークで最先端の結果を達成。
Kinetics-400で90.0%のtop-1精度を達成した最初のモデル。
データセット全体で一貫して高い性能を示し、精度・パラメータ・FLOPのトレードオフが有利。
Kinetics-710でのポストプレトレーニングにより強い転移が可能で、追加微調整は最小限（K400/600/700で実証された）。
画像事前学習済みViTにUniFormer設計を組み合わせると、動画タスクに対して過度な画像事前学習を要さず、頑健な時空表現を得られることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。