QUICK REVIEW

[論文レビュー] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

Wenzhao Xiang, Yue Wu|arXiv (Cornell University)|Mar 10, 2026

Domain Adaptation and Few-Shot Learning被引用数 0

ひとこと要約

C2FMAE は、階層的な視覚表現を学習するために、セマンティックマスク、インスタンスマスク、RGB画像を跨ぐ粗→細のマスク付き自己符号化器と cascaded デコーダー、 progres sive マスキングを導入し、分類・検出・セグメンテーションで強い利得を達成します。

ABSTRACT

Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.

研究の動機と目的

自己教師デ pre-training において、グローバルなセマンティック理解と細粒な視覚ディテールの統合が必要であることを動機づける。
セマンティックマスク、インスタンスマスク、RGB画像の三つのデータ粒度を活用する階層的な前処理フレームワークを提案する。
cascaded デコーダーと progressive マスキング・カリキュラムを介して、トップダウン学習原則を強制する。
階層的前処理が分類・検出・セグメンテーションタスク全般における堅牢な表現を生み出すことを示す。

提案手法

共有ViTエンコーダーに三つの粒度入力（RGB、インスタンスマスク、セマンティックマスク）を供給する。
コース→ファインの精緻化を強制するため、セマンティックマスク、インスタンスマスク、RGB画像を逐次復元する cascaded デコーダーを用いる。
セマンティック指向、インスタンス指向、ランダム masking フェーズを含むトップダウンの progres sive マスキング戦略を実装し、学習中の適応ウェイトでフォーカスを移動させる。
ImageNet-1K（1.28M画像）に対して整列したインスタンス・セマンティックセグメンテーションの疑似ラベルを生成して大規模な多粒度データセットを構築する。
Semantic、Instance、RGB の再構成損失を組み合わせた多タスク objective を、λ_S、λ_I、λ_R のウェイトでバランスして訓練する。

実験結果

リサーチクエスチョン

RQ1粗→細の前処理フレームワークは、高レベルのセマンティクスと精緻なディテールを統一して下流タスクを改善できるか？
RQ2 cascaded デコーダーは平行的なマルチモーダルデコーダーより階層情報の流れをより適切に強制できるか？
RQ3階層的目的に合わせた progres sive マスキングは注意の drift を抑え、表現品質を高めるか？
RQ4C2FMAE は MAE および MultiMAE と比較して画像分類、物体検出、セマンティックセグメンテーションでどの程度の性能を示すか？
RQ5マルチグラニュラ前処理データが下流タスクの性能に与える影響はどの程度か？

主な発見

Method	Model	Modality	Masking	PT Epoch	PT Cost	Acc.
Scratch	ViT-B	-	-	-	-	82.3
MoCo v3	ViT-B	RGB	-	300	-	83.2
DINO	ViT-B	RGB	-	300	-	82.8
BEiT	ViT-B	RGB	Random	800	~7.0x	83.2
MAE	ViT-B	RGB	Random	400	~1.0x	82.9
MAE	ViT-B	RGB	Random	1600	~4.0x	83.6
iBOT	ViT-B	RGB	Random	1600	~5.7x	84.0
UnMAE	ViT-B	RGB	Uniform	200	-	82.9
CAE	ViT-B	RGB	Random	800	~4.6x	83.6
MaskFeat	ViT-B	RGB	Random	1600	~20.1x	84.0
SemMAE	ViT-B	RGB	Semantic	800	-	83.3
AutoMAE	ViT-B	RGB	Semantic	800	-	83.3
ConMIM	ViT-B	RGB	Random	800	~4.4x	83.7
MIRL	ViT-B	RGB	Random	800	-	84.1
ROPIM	ViT-B	RGB	Random	800	~10.4x	84.0
MFM	ViT-B	RGB/Frequency	Random	300	~1.1x	83.1
MultiMAE*	ViT-B	RGB/Dep./Sem.	Random	400	~1.3x	82.7
MultiMAE	ViT-B	RGB/Dep./Sem.	Random	1600	~5.2x	83.3
C2FMAE	ViT-B	RGB/Inst./Sem.	Progressive	400	~1.3x	83.7
C2FMAE	ViT-B	RGB/Inst./Sem.	Progressive	1600	~5.2x	84.2
C2FMAE dagger	ViT-B	RGB/Inst./Sem.	Progressive	1600	~5.2x	84.4

C2FMAE は ImageNet-1K で 400 エポックと 1600 エポックの事前訓練後のファインチューニング精度がそれぞれ 83.7% および 84.2% となり、MAE および MultiMAE を上回る。
COCO の物体検出およびインスタンスセグメンテーションで MAE より +1.8 APb / +1.6 APm、MultiMAE より +2.0 APb / +1.9 APm の改善。
ADE20K のセマンティックセグメンテーションで C2FMAE は 49.1% mIoU に達し、MAE を 1.0%、MultiMAE を 1.3% 上回る。
400-epoch の C2FMAE は MAE の 1600-epoch モデルより精度で上回る（83.7% 対 83.6%）、訓練コストは MultiMAE とほぼ同等、MAE の約 1.3 倍程度。
RGB/Inst/Sem 入力を用いる C2FMAE はタスク間の堅牢性と階層的表現学習で優位性を示す。
アブレーション結果（部分）は、データセットとアーキテクチャ的要素を加えるほど MultiMAE ベースラインより性能が段階的に向上することを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。