QUICK REVIEW

[論文レビュー] EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Yuxin Fang, Wen Wang|arXiv (Cornell University)|Nov 14, 2022

Multimodal Machine Learning Applications被引用数 23

ひとこと要約

EVA は vanilla ViT を 1B パラメータにスケールさせ、マスク付き画像モデリングで画像-テキストに整列した CLIP 特徴を再構成することで、公開データのみを使用してビジョンタスクで最先端の結果を達成し、マルチモーダル理解への橋渡しとなる。

ABSTRACT

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.

研究の動機と目的

公的に利用可能なデータで訓練された、スケーラブルな視覚中心の基盤モデルを探る。
10億パラメータの視覚モデルが視覚タスク全体へ広く転移可能とする前提タスクを特定する。
EVA の画像分類、検出、セマンティックセグメンテーション、および動画アクション認識への転移能力を示す。

提案手法

事前訓練時に相対的位置埋め込みやレイヤースケールを用いず、1.0B パラメータの素の ViT アーキテクチャを使用する。
可視パッチに条件付けられたマスクされた CLIP 特徴を再構成するように事前訓練する（40% マスキングの MIM）。
CC12M, CC3M, COCO, ADE20K, ImageNet-21K, Object365 から公開アクセス可能な 2960 万画像で訓練; 目標は 4 億画像の画像-テキストデータセットで訓練された CLIP 特徴。
NVIDIA A100 GPUs 上でのスケーラブルな事前訓練のために DeepSpeed ZeRO-1 と fp16、コサイン学習率減衰を採用する。
広範なスイートで評価する：ImageNet-1K 分類、ビデオアクション認識（Kinetics）、COCO/LVIS の検出・セグメンテーション、ADE20K および COCO-Stuff のセマンティックセグメンテーション、さらにゼロショットの画像/動画転移。

実験結果

リサーチクエスチョン

RQ1公開データで単純なマスク画像モデリング目的で訓練された10億パラメータの視覚トランスフォーマーは、主要な視覚タスク全体で最先端の転移を達成できるか？
RQ2マスクされた CLIP 視覚特徴を回帰することは、大規模な視覚表現学習に対してスケーラブルで効果的な前処理信号を提供するか？
RQ3EVA のサイズとデータが拡大するにつれて現れる転移特性と頑健性の特徴は何か？

主な発見

モデル	#パラメータ	追加のラベル付きデータ	画像サイズ	トップ1精度
ConvNeXt	?	IN-21K-ext-70M	640^2	87.5
SwinV2	?	IN-21K-ext-70M	640^2	87.5
MAE	?	IN-21K (14M)	?	87.8
DeiT3	?	IN-21K (14M)	?	87.7
Eff-L2-NS	?	IN-21K (14M)	?	88.4
BEiTv2	?	IN-21K (14M)	?	88.4
BEiT	?	IN-21K (14M)	?	88.6
EVA	1.0B	IN-21K (14M)	336^2	89.6

EVA は ImageNet-1K で公開データと線形分類器（336^2 入力）を用いて 89.7% の top-1 精度、同様に 560^2 入力で 89.7% を達成する最先端の結果を得る。
LVISv1.0 と COCO で強力なインスタンス分割および物体検出性能を達成し、COCO-LVIS のギャップを縮小し、単一スケール検証で COCO test-dev で 64.7 AP^box / 55.0 AP^mask を達成。
EVA は ADE20K で 62.3 mIoU、COCO-Stuff で 53.4 mIoU を達成し、強力なセマンティックセグメンテーション転移を示す。
ビデオでは EVA が Kinetics-400 で 89.7% top-1、Kinetics-600 で 89.8%、Kinetics-700 で 82.9% を達成し、画像 pre-trained モデルと同等級のビデオアクション認識を示す。
巨大小 CLIP を EVA から初期化することはゼロショットの画像/動画分類を改善し、トレーニングを安定化させ、マルチモーダル基盤モデルの効率的なスケーリングを可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。