QUICK REVIEW

[論文レビュー] Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

Xiang Li, Wenhai Wang|arXiv (Cornell University)|May 20, 2022

Advanced Neural Network Applications被引用数 36

ひとこと要約

MAEスタイルの自己教師付き事前学習を Uniform Masking によって locality ベースの Pyramid ViTs と統合し、効率的な事前学習を実現するとともに、タスク全体で微調整の性能を高く維持します。

ABSTRACT

Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and fine-tuning accuracy. Notably, the success of the asymmetric structure relies on the "global" property of Vanilla Vision Transformer (ViT), whose self-attention mechanism reasons over arbitrary subset of discrete image patches. However, it is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can be adopted in MAE pre-training as they commonly introduce operators within "local" windows, making it difficult to handle the random sequence of partial vision tokens. In this paper, we propose Uniform Masking (UM), successfully enabling MAE pre-training for Pyramid-based ViTs with locality (termed "UM-MAE" for short). Specifically, UM includes a Uniform Sampling (US) that strictly samples $1$ random patch from each $2 imes 2$ grid, and a Secondary Masking (SM) which randomly masks a portion of (usually $25\%$) the already sampled regions as learnable tokens. US preserves equivalent elements across multiple non-overlapped local windows, resulting in the smooth support for popular Pyramid-based ViTs; whilst SM is designed for better transferable visual representations since US reduces the difficulty of pixel recovery pre-task that hinders the semantic learning. We demonstrate that UM-MAE significantly improves the pre-training efficiency (e.g., it speeds up and reduces the GPU memory by $\sim 2 imes$) of Pyramid-based ViTs, but maintains the competitive fine-tuning performance across downstream tasks. For example using HTC++ detector, the pre-trained Swin-Large backbone self-supervised under UM-MAE only in ImageNet-1K can even outperform the one supervised in ImageNet-22K. The codes are available at https://github.com/implus/UM-MAE.

研究の動機と目的

局所ウィンドウを用いるピラミッド型 ViTs に対して、MAEスタイルの自己教師付き事前学習を動機づけ、可能にする。
効率性を保ちつつ、局所ウィンドウ全体で均一な入力構造を維持する Uniform Masking を設計する。
UM-MAE が事前学習時間と GPU メモリを削減しつつ、下流タスクの性能を維持または改善することを示す。
ImageNet-1K 分類、ADE20K 分割、COCO 物体検出などの下流タスクにおいて、UM-MAE が既存の MIM 手法とどう比較されるかを調査する。

提案手法

Uniform Sampling (US) は、2x2 グリッドごとに1つのランダムなパッチを選択して、25% の可視パッチ集合を作成する。
Secondary Masking (SM) は、すでにサンプリングされた領域の一部（例：25%）をランダムにマスクし、学習可能なマスクトークンを使用する。
Uniform-sampled パッチを Pyramid ベースの ViT エンコーダへ入力するコンパクトな 2D 入力に再編成する。
デコーダは MAE からの軽量 ViT のままで、欠落パッチの再現には平均二乗誤差を用いる。
エンコーダ入力はトークンの 25% に減らされ、ピクセルシャッフルを用いて Pyramid バックボーンの解像度を回復する。
学習は IN1K、ADE20K、COCO において SimMIM および MAE のベースラインと UM-MAE を比較する。時折中間ファインチューニングについても議論される。

実験結果

リサーチクエスチョン

RQ1局所ウィンドウを持つピラミッド型 ViT に MAEスタイルの事前学習を適用しても、過度な計算を伴わずに効果的に適用できるだろうか？
RQ2ピラミッド型アーキテクチャにおいて、転移可能な表現を最もよく保存・強化するサンプリングとマスキング戦略は何か？
RQ3既存の MIM 手法と比べて、UM-MAE は事前学習効率と下流タスクの精度の点でどうであるか？
RQ4中間ファインチューニングは、密な予測タスクにおける UM-MAE の転移効果に影響を与えるか？

主な発見

UM-MAE は、ピラミッド型 ViTs に対して SimMIM と比較して事前学習を著しく高速化（約2×）し、GPU メモリを大幅に削減（≥2×）します。
25% の Secondary Masking 比率を用いた Uniform Sampling は強いトレードオフを生み出し、下流タスクで MAE のベースラインと同等以上を達成します。
Swin-T では、UM-MAE は IN1K 82.04 Top-1、ADE20K 45.96 mIoU、COCO 47.7 AP を設定を跨いで達成し、SimMIM と比較してメモリ・時間の改善を示す。
大型モデル（Swin-L）では、IN1K で事前学習した UM-MAE が、前訓練エポック数を抑えつつ supervised IN22K のベースラインを上回る可能性がある。
MIM 下のピラミッド型 ViT に対して、IN1K での中間ファインチューニングは良好な下流性能のために極めて重要で、直接ファインチューニングよりもしばしば利得をもたらす。
UM-MAE は、強力な MIM ベースラインと比較して、事前学習リソースを削減しつつ競争力のあるまたは改善された下流性能を維持します。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。