QUICK REVIEW

[論文レビュー] BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Dong Li|arXiv (Cornell University)|Jun 15, 2021

Multimodal Machine Learning Applications参考文献 52被引用数 922

ひとこと要約

BEiT は離散視覚トークン・トークナイザーを用いたマスクド画像モデリングにより視覚トランスフォーマーを事前学習し、ImageNet と ADE20K でファインチューニング時の性能を高める。

ABSTRACT

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

研究の動機と目的

Motivate reducing data needs for training vision Transformers by leveraging self-supervised pre-training.
Introduce a BERT-like masked image modeling objective for images.
Use a discrete visual tokenizer to predict tokens rather than pixel values.
Show that BEiT pre-training speeds up fine-tuning and improves convergence.
Demonstrate that BEiT learns semantic regions without labels.

提案手法

Tokenize images into discrete visual tokens using a pre-learned image tokenizer (dVAE).
Split images into 14x14 patches and feed them as Transformer inputs (patch embeddings).
Mask about 40% of patches and predict the corresponding visual tokens via a softmax over the token vocabulary.
Pre-train a ViT-like Transformer with the MIM objective, leveraging blockwise masking to improve locality.
Fine-tune the pretrained encoder on downstream tasks by adding task-specific heads (classification, segmentation).
Optionally perform intermediate fine-tuning on a labeled dataset (e.g., ImageNet) before task finetuning.

実験結果

リサーチクエスチョン

RQ1Can a BERT-style masked image modeling objective enable effective self-supervised pre-training for vision Transformers?
RQ2Do discrete visual tokens (from a dVAE) provide a better pretraining bottleneck than pixel-level reconstruction?
RQ3Does blockwise masking improve pre-training efficacy for downstream vision tasks?
RQ4Is BEiT complementary to supervised pre-training and beneficial with intermediate fine-tuning?
RQ5What representations (e.g., attention maps) emerge after BEiT pre-training for semantic regions?

主な発見

BEiT outperforms training from scratch and several prior self-supervised methods on ImageNet after fine-tuning.
BEiT-L scales better than supervised pre-training with ImageNet-22K, with BEiT-384-L surpassing BEiT-384 by about 1.7 percentage points.
BEiT achieves 83.2% top-1 on ImageNet for BEiT-B and 86.3% for BEiT-384-L (Table 1).
Intermediate fine-tuning on ImageNet provides additional gains for BEiT on ImageNet and downstream tasks.
On ADE20K semantic segmentation, BEiT reaches 45.6 mIoU, and 47.7 with intermediate fine-tuning (Table 3).
Ablations show blockwise masking and predicting visual tokens are critical; pixel-level recovery performs worse than token-based prediction.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。