QUICK REVIEW

[论文解读] SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

Gang Li, Heliang Zheng|arXiv (Cornell University)|Jun 21, 2022

Generative Adversarial Networks and Image Synthesis被引用 49

一句话总结

SemMAE 引入自监督的语义部件学习阶段以创建语义部件，然后使用语义引导的掩码策略训练掩码自编码器，提升图像表征并在若干视觉任务上达到最先进结果。

ABSTRACT

Recently, significant progress has been made in masked image modeling to catch up to masked language modeling. However, unlike words in NLP, the lack of semantic decomposition of images still makes masked autoencoding (MAE) different between vision and language. In this paper, we explore a potential visual analogue of words, i.e., semantic parts, and we integrate semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy. Compared to widely adopted random masking, our masking strategy can gradually guide the network to learn various information, i.e., from intra-part patterns to inter-part relations. In particular, we achieve this in two steps. 1) Semantic part learning: we design a self-supervised part learning method to obtain semantic parts by leveraging and refining the multi-head attention of a ViT-based encoder. 2) Semantic-guided MAE (SemMAE) training: we design a masking strategy that varies from masking a portion of patches in each part to masking a portion of (whole) parts in an image. Extensive experiments on various vision tasks show that SemMAE can learn better image representation by integrating semantic information. In particular, SemMAE achieves 84.5% fine-tuning accuracy on ImageNet-1k, which outperforms the vanilla MAE by 1.4%. In the semantic segmentation and fine-grained recognition tasks, SemMAE also brings significant improvements and yields the state-of-the-art performance.

研究动机与目标

通过发现视觉语言的类比词（语义部件）来弥合掩码图像建模与掩码语言模型之间的差距，并提出视觉语言的对应物
开发一种自监督的语义部件学习方法，在像 ImageNet 这样的多类数据集上获得有意义的部件映射
提出一种语义引导掩码策略，使 MAE 的训练从部内信息逐步过渡到部间信息
证明将语义信息纳入表示学习可以提升在分类、分割和细粒度任务中的表征学习能力

提出的方法

设计一个两阶段框架：语义部件学习与语义引导掩码
通过将 ViT 类标记嵌入到 N 个部件标记中来获得语义部件，计算补丁-部件相关性，并通过模糊化和基于 StyleGAN 的解码器进行 AdaIN 纹理迁移来得到经过 refined 的注意力图
对注意力图取 argmax 以获得部件分割，并以此引导 MAE 的掩码：从部件内的掩码补丁逐步过渡到整个部件的掩码
引入一个重建目标，利用基于 StyleGAN 的解码器和注意力多样性损失来学习空间部件结构（L_rec、L_div、总损失 L）
实现一个自适应掩码策略，在训练迭代中通过插值参数 alpha 在部内层级掩码与部间层级掩码之间达到平衡

实验结果

研究问题

RQ1自监督学习得到的语义部件是否可以为 MAE 的训练提供有意义的引导？
RQ2基于语义部件的掩码（部内与部间引导）是否相较于随机掩码能提升 MAE 的表征？
RQ3补丁尺寸和掩码策略对 SemMAE 在线性探针、微调和下游任务上的表现有何影响？

主要发现

SemMAE 在 ImageNet-1K 的线性探针上达到 84.5% 的 Top-1 准确率，领先普通 MAE 1.4%。
使用 8x8 补丁进行语义部件学习可提供更好的部件分割，在线性探针准确率上与基线相比提升 1.3–1.9 个百分点。
自适应掩码策略（从每个部件 75% 补丁到 75% 部件，gamma=2）实现最佳线性探针结果（68.7%）
在语义分割 ADE20K 上，SemMAE 达到 46.3 的 mIoU，优于 MAE（46.1）和有监督预训练（45.3）
在细粒度迁移中，SemMAE 的表现优于 MAE，包含 iNaturalist（82.1 对 81.8）、CUB 鸟类（87.1 对 86.5）、Stanford Cars（94.4 对 94.2）等
表格对比显示 SemMAE 在线性探针、微调与下游任务上达到最先进或具竞争力的性能

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。