QUICK REVIEW

[论文解读] iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Wei Chen|arXiv (Cornell University)|Nov 15, 2021

Multimodal Machine Learning Applications被引用 209

一句话总结

iBOT 引入了一种用于遮蔽图像建模的在线视觉分词器，通过自蒸馏，在ImageNet上达到最新结果，并在密集下游任务上表现出色。

ABSTRACT

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

研究动机与目标

通过利用语义上有意义的视觉标记，推动面向视觉的 BERT 式预训练范式。
通过与模型在线一起学习分词器，消除对离线预训练分词器的需求。
通过使用来自在线分词器的教师分词器进行知识蒸馏，改进遮蔽图像建模（MIM）。
探索如何在共同学习分词语义时提升鲁棒性及在分类和密集任务上的下游表现。

提出的方法

将遮蔽图像建模表述为教师（在线分词器）与学生（Vision Transformer）之间的知识蒸馏。
使用两种损失：跨视图的 [CLS] 自蒸馏损失以获取视觉语义，以及 MIM 损失以利用教师输出重建被遮蔽的补丁令牌。
在 [CLS] 与补丁令牌之间共享投影头以传播语义信息。
实现一个在线分词器，通过动量法与学生共同更新，从而省去对离线分词器的预训练需求。
在跨视图 [CLS] 令牌上进行自蒸馏，以引导产生有意义的视觉语义，并使用 softmax 令牌分布而非硬 one-hot 令牌作为监督。
使用 ViT 和 Swin 主干进行评估，在 ImageNet-1K 和 ImageNet-22K 上预训练，随后进行线性探针、k-NN 和微调。

实验结果

研究问题

RQ1带来在线、共同学习视觉分词器的遮蔽图像建模能否在自监督预训练 Vision Transformers 时超越使用离线分词器的做法？
RQ2在 [CLS] 令牌和补丁级 MIM 信号上的自蒸馏机制，是否为下游任务带来更强的语义表征和鲁棒性？
RQ3在 [CLS] 与补丁令牌之间共享投影头如何影响学到的语义和性能？
RQ4在线分词器语义对线性探针、微调及向密集视觉任务的迁移有何影响？

主要发现

iBOT 在 ImageNet-1K 的多种评测设置中达到最新结果，其中在 ImageNet-22K 上以 ViT-L/16 预训练时线性探针 82.3% 与微调 87.8% 的准确率。
在标准 ImageNet-1K 上，iBOT 使用 ViT-S/16、ViT-B/16、ViT-L/16，在更大规模的预训练数据下的微调最高达到 84.8%，线性评测 82.3%，超越先前的 SSL 方法。
iBOT 发现了补丁令牌中的部件级语义，为对抗干扰的鲁棒性提升以及对密集任务如目标检测、实例分割和语义分割的性能提升做出贡献。
Compared to DINO，iBOT shows larger gains with bigger models, indicating stronger scalability of the online-tokenizer approach.
在小数据集（如 CIFAR、Flowers、Cars）和更大领域数据集（iNaturalist 18/19）的迁移学习基准中，iBOT 始终优于 BEiT 与 DINO 基线，特别是使用更大主干模型时。
iBOT 展现出相对于强基线在背景变化、遮挡和分布外数据上的鲁棒性提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。