QUICK REVIEW

[论文解读] Masked Frequency Modeling for Self-Supervised Visual Pre-Training

Jiahao Xie, Wei Li|arXiv (Cornell University)|Jun 15, 2022

Image Processing Techniques and Applications被引用 28

一句话总结

MFM 在傅里叶域对频率分量进行掩码，并预测缺失的频率，以学习用于 ViT 和 CNN 的视觉表示，而无需掩码标记，达到与以往 MIM 方法相当的性能与对鲁棒性的表现。

ABSTRACT

We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach.

研究动机与目标

研究在频率域掩码是否能比空间掩码产生更好的自监督表征。
开发一个灵活的、与架构无关的预训练框架（ViT 和 CNN），不依赖掩码标记。
将频率域损坏与传统的低级空间损坏以及现有的遮掩图像建模（MIM）方法进行比较。
在图像分类和语义分割上评估 MFM，并在若干基准上评估鲁棒性。
从统一的频率视角探讨经典图像恢复任务与 MFM 之间的关系。

提出的方法

使用 FFT 将图像转换到频率域，并通过半径为 r 的圆形掩模对部分频率分量进行低通/高通掩码。
随机在低通掩码和高通掩码输入之间进行选择，将被损坏的空间图像输入给编码器（ViT 或 CNN），不插入掩码标记。
使用一个轻量级线性解码器，通过频域损失在频谱上重建被掩码的频率分量。
将重建损失定义为跨掩码谱的幅度和相位差的频率距离（L = 对掩码谱取平均的 |F_r - F_o|^gamma，gamma 通常为 1）。
在 ImageNet-1K 上进行自监督训练，并在下游任务上评估：ImageNet-1K 微调和 ADE20K 语义分割。
证明仅预测被掩码的谱比重建完整谱更有效，并且频率域损失优于空间损失。

实验结果

研究问题

RQ1在频率域掩码是否能在没有掩码标记的情况下，为 ViT 和 CNN 学习到更丰富的表征？
RQ2掩码类型（低通/高通/随机）、半径、形状和采样方式如何影响 MFM 的性能？
RQ3在性能与鲁棒性方面，MFM 与低级图像恢复任务及现有的 MIM 方法相比如何？
RQ4在像 ViT 和 ResNet-50 这样的架构下，MFM 能否在 ImageNet 分类和 ADE20K 分割上取得有竞争力的结果？
RQ5MFM 对跨基准的对抗性攻击及常见损坏的鲁棒性有何影响？

主要发现

MFM 在 ImageNet-1K 上进行 300 轮预训练后，在 ViT-B/16 上达到 83.1% 的 Top-1，在 ViT-S/16 上达到 81.6%（无掩码标记）。
在 ADE20K 上，使用 MFM 的 ViT-B/16 达到 48.6 mIoU，在某些设置下超越了若干自监督方法和监督基线。
MFM 在鲁棒性基准测试中常名列前茅，同时保持较强的标准精度（例如表 6 的鲁棒性指标）。
低通/高通频率掩码、随机掩码以及仅预测被掩码的谱，相较于重建完整谱，均有助于提升性能。
与低级图像处理任务（SR、去模糊、去噪）相比，频率域视角揭示了它们的有效性及与体系结构（ViT 与 CNN）的相互作用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。