QUICK REVIEW

[论文解读] PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

Yuan Liu, Songyang Zhang|arXiv (Cornell University)|Mar 4, 2023

Generative Adversarial Networks and Image Synthesis被引用 13

一句话总结

PixMIM 分析基于像素的 MIM 瓶颈并引入一个简单的即插即用方法，该方法弱化目标中的高频纹理并保留前景信息，在最小成本下改善 MAE、ConvMAE 和 LSMAE。

ABSTRACT

Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the framework with new auxiliary tasks or extra pre-trained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks. Based on this analysis, we propose a remarkably simple and effective method, {\ourmethod}, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network's focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. {\ourmethod} can be easily integrated into most existing pixel-based MIM approaches (\ie, using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves three MIM approaches, MAE, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM framework. Code and models are available at \url{https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/pixmim}.

研究动机与目标

识别与重建目标和输入补丁相关的基于像素的遮罩图像建模（MIM）中的瓶颈。
提出一种简单的、可插拔的方法，在不产生额外计算成本的前提下改进现有的 MIM 方法。
证明 PixMIM 能在多种 MIM 基线和下游任务上实现泛化。
展示在 PixMIM 下的鲁棒性和形状偏置改进。

提出的方法

分析 MAE 风格的基于像素的 MIM，涉及重建目标和输入补丁特征。
提出两种策略：（1）通过在频域应用低通滤波来创建低频重建目标；（2）用 Simple Resized Crop (SRC) 取代 Random Resized Crop (RRC) 以保留前景内容。
给出一个高效实现，使用 RGB 目标和基于 FFT 的低通滤波，而不增加额外的训练开销。
通过将 PixMIM 应用于 MAE、ConvMAE 和 LSMAE，并在 ImageNet-1K、ADE20K 和 COCO 上进行评估，展示即插即用的兼容性。

实验结果

研究问题

RQ1基于像素的 MIM 中与重建目标和输入补丁相关的基本瓶颈是什么？
RQ2是否可以对目标和增强进行简单、无成本的修改，以在多种 MIM 基线中提升表征质量？
RQ3低频目标和更保守的增强是否提升鲁棒性、形状偏置以及下游任务性能？
RQ4PixMIM 是否在 ImageNet 分类、语义分割和目标检测数据集上普遍有益？

主要发现

两个基于像素的 MIM 的主要瓶颈被识别：重建目标强调高频纹理，输入补丁在强遮罩下前景覆盖率通常较低。
PixMIM 在 ImageNet 线性评估和微调、COCO 目标检测以及 ADE20K 分割上，持续提升 MAE、ConvMAE 和 LSMAE，且额外计算几乎为零。
低频重建目标将学习偏向于形状和全局模式，提升鲁棒性和形状偏置。
将 Random Resized Crop (RRC) 替换为 Simple Resized Crop (SRC) 提高了训练中的前景覆盖率，有助于表示学习。
PixMIM 还增强了对分布偏移（ImageNet 变体）的鲁棒性，并改善了相对于基线的形状偏置指标。
消融实验显示接近 r=40 的低通滤波带宽最优，并证实将两个 PixMIM 组件结合的收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。