QUICK REVIEW

[論文レビュー] Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen|arXiv (Cornell University)|Nov 11, 2021

Domain Adaptation and Few-Shot Learning参考文献 59被引用数 190

ひとこと要約

本論文は Masked Autoencoders（MAE）を提案します。画像パッチの大部分をマスキングして欠損ピクセルを再構成することで、視覚トランスフォーマーを事前学習させる手法です。非対称のエンコーダ-デコーダ設計を用い、スケーラブルな自己教師あり学習を実現し、ImageNet-1K での教師あり事前学習を上回り、下流タスクへの転移性能を高めます。

ABSTRACT

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

研究の動機と目的

視覚のスケーラブルな自己教師付き事前学習を、ラベル付きデータに依存せずに大規模モデルへスケールさせることを動機付ける。
エンコーダが可視パッチのみを処理し、軽量デコーダが全体の画像を再構成する非対称MAEアーキテクチャを開発する。
高いマスキング比（約75%）が意味のある自己監視を生み出し、より速く、メモリ効率の高い事前学習を可能にすることを示す。
MAEの事前学習が検出、分割、分類タスクに対する転移性能を、教師あり事前学習と比較して改善することを示す。

提案手法

入力画像からランダムにパッチを抽出して非重複パッチにマスクを施し、大部分をマスクする（例：75%）。
マスクトークンを含まず、可視パッチのみを処理するエンコーダを使用して潜在表現を形成する。
エンコードされた可視パッチにマスクトークンを加え、元の画像をピクセルレベルで再構成する軽量デコーダを接続する。
再構成損失（平均二乗誤差）をマスクされたパッチのみに対して計算し、場合によってはピクセル値のパッチごとの正規化を行う。
すべてのトークンに位置エンベディングを適用する。デコーダはエンコーダとは独立して小さく、計算量を削減する。
事前学習後に全画像でエンコーダを認識タスクのためにファインチューニングして評価し、教師ありベースラインと比較する。

Figure 1 : Our MAE architecture . During pre-training, a large random subset of image patches ( e.g . , 75%) is masked out. The encoder is applied to the small subset of visible patches . Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed b

実験結果

リサーチクエスチョン

RQ1高いマスキング比を伴うマスキング自己符号化は、スケーラブルな自己教師あり視覚表現を提供できるか？
RQ2非対称のエンコーダ-デコーダ設計は計算量を削減しつつ表現品質を維持・向上できるか？
RQ3MAEの事前学習はモデルサイズの増加と下流視覚タスクへの転移において、教師あり事前学習と比べてどう拡張するか？
RQ4どの再構成ターゲット（ピクセル vs トークン）とマスク戦略が最良の転移性能を生み出すか？

主な発見

高いマスキング（約75%）を用いたMAEは強力な自己監視表現を生み出し、ファインチューニング時にImageNet-1Kで supervise pre-training を上回る大規模 ViT モデルを可能にする。
可視パッチのみを処理するエンコーダと再構成を担当する小さなデコーダという非対称設計は、トレーニングFLOPとメモリを大幅に削減し、3倍以上のスピードアップを提供する。
デコーダの深さと幅はファインチューニングより線形プロービングで影響を受けやすく、深いデコーダは線形プロービングを助け、非常に小さなデコーダはファインチューニングには十分である。
ピクセルベースの再構成（正規化あり）は転移タスクでトークンベースのターゲットより優れており、強力な性能にはトークン化は必須ではない。
MAEは物体検出、インスタンスセグメンテーション、意味的セグメンテーションへの堅牢な転移を示し、モデルサイズが大きくなるにつれて教師あり pre-training よりしばしば上回る利益をもたらす。

Figure 2 : Example results on ImageNet validation images. For each triplet, we show the masked image (left), our MAE reconstruction † (middle), and the ground-truth (right). The masking ratio is 80%, leaving only 39 out of 196 patches. More examples are in the appendix. † As no loss is computed on v

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。