QUICK REVIEW

[论文解读] Benchmarking Detection Transfer Learning with Vision Transformers

Yanghao Li, Saining Xie|arXiv (Cornell University)|Nov 22, 2021

Advanced Neural Network Applications参考文献 30被引用 75

一句话总结

论文在 COCO 上把五种 ViT 初始化（随机、监督 ImageNet、MoCo v3、BEiT、MAE）作为骨干在 Mask R-CNN 中进行基准测试，显示基于掩码的预训练获得最强的迁移增益，这些增益随模型大小而扩大。

ABSTRACT

Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consumption, unknown training formulae, etc.) have prevented recent studies from benchmarking detection transfer learning with standard ViT models. In this paper, we present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN. These tools facilitate the primary goal of our study: we compare five ViT initializations, including recent state-of-the-art self-supervised learning methods, supervised initialization, and a strong random initialization baseline. Our results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO, increasing box AP up to 4% (absolute) over supervised and prior self-supervised pre-training methods. Moreover, these masking-based initializations scale better, with the improvement growing as model size increases.

研究动机与目标

建立一个面向检测/实例分割的 Vision Transformer 骨干的迁移学习评估协议，使用 COCO 和 Mask R-CNN。
克服实际挑战，使 ViT 骨干能够在标准检测框架中使用。
系统性比较多种初始化方法（随机、监督、MoCo v3、BEiT、MAE）在检测任务上的表现。

提出的方法

将 ViT 骨干适配到 Mask R-CNN，使其能够通过四个分辨率修改模块在 ViT 深度中跨尺度特征金字塔与 FPN 兼容。
使用窗口自注意力以降低内存/时间开销，插入四个全局注意力块以保持跨窗口信息。
升级 Mask R-CNN 组件（卷积后 BN、较长训练调度、以及 LSJ 数据增强），使从头训练或微调预训都可行。
使用一致的训练公式（LSJ、AdamW、warmup、drop path）和聚焦学习率、权重衰减、drop path 的超参数调优协议。
通过处理绝对和相对位置嵌入来标准化位置信息，确保不同预训练方法之间的公平比较。

实验结果

研究问题

RQ1将不同的 ViT 初始化在作为 Backbones 于 Mask R-CNN 的 COCO 目标检测和实例分割上的表现如何？
RQ2基于掩码的预训练方法（BEiT、MAE）是否相对于监督预训练和随机初始化提供迁移学习增益，并且这些增益如何随模型大小扩大？
RQ3哪些内存/时间权衡和架构选择使 ViT 骨干在检测框架中具有竞争力？
RQ4位置编码方案如何影响不同初始化方法的微调性能？

主要发现

初始化	数据	ViT-B APbox	ViT-L APbox	ViT-B APmask	ViT-L APmask
supervised	IN1k w/ labels	47.9	49.3	42.9	43.9
random	none	48.9	50.7	43.6	44.9
MoCo v3	IN1k	47.9	49.3	42.7	44.0
BEiT	IN1k + DALL•E	49.8	53.3	44.4	47.1
MAE	IN1k	50.3	53.3	44.9	47.2

ViT 骨干的 Mask R-CNN 在不同初始化方法下训练平滑，不需要梯度裁剪。
从头训练对 ViT-B 相对于监督 ImageNet 预训练在 APbox 上可提升高达 1.4；对 ViT-L 的增益更大。
MoCo v3 在 APbox 上不及随机初始化，且与监督初始化持平。
BEiT 和 MAE 在 APbox 方面分别领先随机和监督预训练，ViT-B 提升多达 2.4 APbox，ViT-L 高达 4.0 APbox；基于掩码的方法呈现出随模型大小增强的更强的扩展性。
基于掩码的预训练（BEiT、MAE）提供了首个令人信服的 COCO 迁移增益，并且随着模型规模增大而增益增加，这与监督或 MoCo v3 不同。
与随机初始化相比，预训练使 COCO 收敛速度加快约4倍，基于掩码的方法在缩放上提供最大的增益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。