QUICK REVIEW

[论文解读] BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Dong Li|arXiv (Cornell University)|Jun 15, 2021

Multimodal Machine Learning Applications参考文献 52被引用 922

一句话总结

BEiT 提出了带离散视觉标记的掩码图像建模，用以预训练视觉变换器，在 ImageNet 和 ADE20K 上实现强劲的微调性能。

ABSTRACT

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

研究动机与目标

通过利用自监督预训练来降低训练视觉变换器所需的数据量。
为图像引入类似 BERT 的掩码图像建模目标。
使用离散视觉标记来预测标记而不是像素值。
证明 BEiT 的预训练可以加速微调并改善收敛性。
证明 BEiT 在有标签数据下学习到语义区域。

提出的方法

使用预训练图像标记器（dVAE）将图像标记为离散视觉标记。
将图像分成 14x14 的补丁并作为 Transformer 输入（补丁嵌入）。
对大约 40% 的补丁进行掩蔽，并通过词汇表上的 softmax 预测相应的视觉标记。
使用 MIM 目标对 ViT 型 Transformer 进行预训练，利用分块掩蔽以改善局部性。
通过添加任务特定头部（分类、分割）对预训练的编码器在下游任务上进行微调。
可选地在任务微调之前，在带标签的数据集上进行中间微调（例如 ImageNet）。

实验结果

研究问题

RQ1是否可以通过 BERT 风格的掩码图像建模目标实现对视觉变换器的有效自监督预训练？
RQ2来自 dVAE 的离散视觉标记是否比像素级重建提供更好的预训练瓶颈？
RQ3分块掩蔽是否提高对下游视觉任务的预训练效果？
RQ4BEiT 是否与有监督预训练互补并在中间微调时有利？
RQ5在 BEiT 预训练后，产生哪些表示（如注意力图）以描述语义区域？

主要发现

BEiT 在微调后超过从零开始训练和若干先前的自监督方法在 ImageNet 的表现。
BEiT-L 相较于 ImageNet-22K 的有监督预训练更具扩展性，BEiT-384-L 比 BEiT-384 提升约 1.7 个百分点。
BEiT 在 ImageNet 的 BEiT-B 达到 83.2% 的 top-1，在 BEiT-384-L 上达到 86.3%（表 1）。
在 ImageNet 上的中间微调为 BEiT 在 ImageNet 和下游任务提供额外提升。
在 ADE20K 语义分割上，BEiT 达到 45.6 mIoU，经过中间微调后为 47.7（表 3）。
消融实验表明分块掩蔽和预测视觉标记至关重要；像素级重建的表现不如基于标记的预测。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。