QUICK REVIEW

[论文解读] Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

Xiang Li, Wenhai Wang|arXiv (Cornell University)|May 20, 2022

Advanced Neural Network Applications被引用 36

一句话总结

通过 Uniform Masking 将 MAE 风格的预训练与基于局部区域的 Pyramid ViT 统一起来，实现高效的预训练并在跨任务的微调中保持强劲的性能。

ABSTRACT

Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and fine-tuning accuracy. Notably, the success of the asymmetric structure relies on the "global" property of Vanilla Vision Transformer (ViT), whose self-attention mechanism reasons over arbitrary subset of discrete image patches. However, it is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can be adopted in MAE pre-training as they commonly introduce operators within "local" windows, making it difficult to handle the random sequence of partial vision tokens. In this paper, we propose Uniform Masking (UM), successfully enabling MAE pre-training for Pyramid-based ViTs with locality (termed "UM-MAE" for short). Specifically, UM includes a Uniform Sampling (US) that strictly samples $1$ random patch from each $2 imes 2$ grid, and a Secondary Masking (SM) which randomly masks a portion of (usually $25\%$) the already sampled regions as learnable tokens. US preserves equivalent elements across multiple non-overlapped local windows, resulting in the smooth support for popular Pyramid-based ViTs; whilst SM is designed for better transferable visual representations since US reduces the difficulty of pixel recovery pre-task that hinders the semantic learning. We demonstrate that UM-MAE significantly improves the pre-training efficiency (e.g., it speeds up and reduces the GPU memory by $\sim 2 imes$) of Pyramid-based ViTs, but maintains the competitive fine-tuning performance across downstream tasks. For example using HTC++ detector, the pre-trained Swin-Large backbone self-supervised under UM-MAE only in ImageNet-1K can even outperform the one supervised in ImageNet-22K. The codes are available at https://github.com/implus/UM-MAE.

研究动机与目标

激发并实现面向使用局部窗口的金字塔式 ViT 的 MAE 风格自监督预训练。
设计 Uniform Masking，使在局部窗口之间保持统一的输入结构，同时保持高效。
证明 UM-MAE 能在减少预训练时间和 GPU 内存消耗的同时，保持或提升下游任务的性能。
探究 UM-MAE 相较于现有的 MIM 方法在 ImageNet-1K 分类、ADE20K 分割和 COCO 目标检测等下游任务上的表现。

提出的方法

Uniform Sampling (US) 从每个 2x2 网格中随机选择一个补丁，形成 25% 可见补丁集合。
Secondary Masking (SM) 随机对已抽样区域的一部分（例如 25%）进行遮罩，并使用可学习的遮罩令牌。
将均匀抽样得到的补丁重新组织成紧凑的 2D 输入，送入 Pyramid-based ViT 编码器。
解码器保持来自 MAE 的轻量级 ViT，通过对丢失补丁的像素实现均方误差损失来重建原始图像像素。
编码器输入被减至 25% 的 token；使用像素重排（pixel shuffle）来恢复 Pyramid 骨干网络的分辨率。
训练将 UM-MAE 与 SimMIM 和 MAE 基线在 IN1K、ADE20K、COCO 上进行比较；并讨论偶尔的中间微调。

实验结果

研究问题

RQ1MAE 风格的自监督预训练是否可以在不产生过多计算的情况下有效应用于具有局部窗口的金字塔式 ViT？
RQ2哪些采样和遮罩策略最能保留或提升面向金字塔架构的可迁移表征？
RQ3与现有的 MIM 方法相比，UM-MAE 在预训练效率和下游任务准确率方面的表现如何？
RQ4中间微调是否会影响 UM-MAE 在密集预测任务上的迁移收益？

主要发现

与 SimMIM 相比，UM-MAE 极大地提升了预训练速度（约 2×）并降低了 GPU 内存占用（≥2×），适用于金字塔式 ViT。
25% 的 Secondary Masking 比例的 Uniform Sampling 能实现强烈的权衡，在下游任务上达到甚至超过 MAE 基线。
对于 Swin-T，UM-MAE 在 IN1K Top-1 82.04、ADE20K mIoU 45.96、COCO AP 47.7 的多组设置下达到这些数值，同时相较于 SimMIM 在内存/时间上有改进。
在大模型（Swin-L）上，在 IN1K 预训练的 UM-MAE 能在较少的预训练轮次就超越监督的 IN22K 基线。
在 IN1K 上进行的中间微调对于 MIM 下的金字塔式 ViT 的良好下游性能至关重要，通常比直接微调带来更高的收益。
UM-MAE 在降低预训练资源的同时，保持竞争力或提升的下游性能，相对于强基线 MIM 方案。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。