QUICK REVIEW

[論文レビュー] Masked Frequency Modeling for Self-Supervised Visual Pre-Training

Jiahao Xie, Wei Li|arXiv (Cornell University)|Jun 15, 2022

Image Processing Techniques and Applications被引用数 28

ひとこと要約

MFM は Fourier ドメインで周波数成分をマスクし、欠落した周波数を予測して、マスクトークンなしで ViT と CNN の視覚表現を学習し、従来の MIM 手法に対して競争力のある性能と頑健性を実現する。

ABSTRACT

We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach.

研究の動機と目的

周波数ドメインでのマスキングが、空間マスキングよりも優れた自己教師付き表現を生み出せるかを調査する。
マスクトークンに依存しない、柔軟でアーキテクチャに依存しない事前学習フレームワーク（ViT と CNN）を開発する。
周波数ドメインの破損と、従来の低レベル空間的破損および既存のマスク付き画像モデリング（MIM）手法を比較する。
画像分類とセマンティックセグメンテーションにおける MFM を評価し、複数のベンチマークにおける頑健性を評価する。
統一された周波数視点から、古典的な画像復元タスクと MFM との関係を探る。

提案手法

画像を FFT で周波数領域に変換し、半径 r の円形マスクを用いた低域/高域フィルタで周波数成分の一部をマスクする。
低域マスクと高域マスクの入力をランダムに選択し、破損した空間画像をマスクトークンを挿入せずにエンコーダ（ViT または CNN）に入力する。
周波数スペクトラム上のマスクされた周波数を再構成するため、軽量な線形デコーダを用い、周波数領域損失で学習する。
再構成損失を、マスクされたスペクトル全体で振幅と位相の差を組み合わせた周波数距離として定義する（L = masked spectrum の平均 of |F_r - F_o|^gamma、gamma は通常 1）。
ImageNet-1K で自己教師付き学習を行い、下流タスクを ImageNet-1K 微調整と ADE20K セマンティックセグメンテーションで評価する。
マスクされたスペクトルのみを予測する方が、全スペクトルを再構成するより効果的であり、周波数領域の損失が空間的損失より優れていることを示す。

実験結果

リサーチクエスチョン

RQ1周波数ドメインでのマスキングは、マスクトークンを用いずに ViT および CNN のより豊かな表現を学習できるか。
RQ2マスクの種類（低域/高域/ランダム）、半径、形状、サンプリングが MFM の性能にどう影響するか。
RQ3性能と頑健性の観点から、MFM は低レベルな画像復元タスクおよび既存の MIM 手法とどのように比較されるか。
RQ4ViT や ResNet-50 のようなアーキテクチャで、ImageNet 分類と ADE20K セグメンテーションで競争力のある結果を MFM が達成できるか。
RQ5ベンチマーク全体で、敵対的攻撃および一般的な劣化に対する頑健性に対する MFM の影響はどのようか。

主な発見

MFM は ImageNet-1K で 300-epoch の事前学習後、ViT-B/16 で 83.1%、ViT-S/16 で 81.6% の top-1 を達成する（マスクトークンなし）。
ADE20K では、MFM を用いた ViT-B/16 は 48.6 mIoU に到達し、いくつかの設定で自己教師付き手法や監視付きベースラインを上回る。
MFM は頑健性ベンチマークで上位手法の中でしばしば位置づけられ、標準精度も高水準を維持する（例: Table 6 の頑健性指標）。
低域/高域マスク、ランダムマスク、およびマスクされたスペクトルのみを予測することが、全スペクトルを再構成するより性能を向上させる。
低レベルの画像処理タスク（SR、デブラー、デノイジング）と比較すると、周波数領域の視点はそれらの有効性とアーキテクチャ（ViT 対 CNN）との相互作用を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。