QUICK REVIEW

[論文レビュー] PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments

Guoliang Zhu, Wanjun Jia|arXiv (Cornell University)|Mar 10, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

PanoAffordanceNet は歪み認識付きのモジュレーションと球体細分化を備えた holistic な 360° 室内機能 grounding フレームワークを導入し、評価用のパノラマデータセット 360-AGD を提案する。パノラマシーンにおけるワンショット grounding の最先端を達成し、透視ビューへ一般化する。

ABSTRACT

Global perception is essential for embodied agents in 360° spaces, yet current affordance grounding remains largely object-centric and restricted to perspective views. To bridge this gap, we introduce a novel task: Holistic Affordance Grounding in 360° Indoor Environments. This task faces unique challenges, including severe geometric distortions from Equirectangular Projection (ERP), semantic dispersion, and cross-scale alignment difficulties. We propose PanoAffordanceNet, an end-to-end framework featuring a Distortion-Aware Spectral Modulator (DASM) for latitude-dependent calibration and an Omni-Spherical Densification Head (OSDH) to restore topological continuity from sparse activations. By integrating multi-level constraints comprising pixel-wise, distributional, and region-text contrastive objectives, our framework effectively suppresses semantic drift under low supervision. Furthermore, we construct 360-AGD, the first high-quality panoramic affordance grounding dataset. Extensive experiments demonstrate that PanoAffordanceNet significantly outperforms existing methods, establishing a solid baseline for scene-level perception in embodied intelligence. The source code and benchmark dataset will be made publicly available at https://github.com/GL-ZHU925/PanoAffordanceNet.

研究の動機と目的

ERP による歪みが生じる 360° 室内環境において、機能 grounding をオブジェクト中心・透視ビューから holistic なシーンレベル推論へ移行させる。
歪み認識モジュレーション、まばらな機能領域、および意味的ドリフトに対処するための専門モジュールと多段階 supervision を導入する。
評価を標準化する高品質なパノラマ機能 grounding データセットを提供する。
提案手法のパノラマ領域と透視ビュー領域の両方での頑健性と一般化を示す。

提案手法

LoRA ベースの適応を用いた dual-encoder 特徴抽出による多モーダル grounding。
緯度適応型の双周波スペクトル蒸留を行う Distortion-Aware Spectral Modulator（DASM）。
球面認識階層型デコーダと Omni-Spherical Densification Head（OSDH）を用いて球上の稀な活性を密集化。
ピクセルレベル、分布レベル（KL）、領域-テキスト対比（InfoNCE） losses を組み合わせた多段階トレーニング objective。

実験結果

リサーチクエスチョン

RQ1ERP 歪みと sparse な領域にもかかわらず、360° 室内環境で機能を holistic に grounding するにはどうすればよいか。
RQ2歪み認識モジュレーションと球面細密化は sparse な活性からトポロジー的に連続した機能領域を回復できるか。
RQ3ピクセル-, 分布-, 領域-テキスト supervisory を統合することは grounding の精度を改善し、意味的ドリフトを減らすか。
RQ4提案手法はパノラマデータでどの程度性能を発揮し、透視ビューのデータセットへ一般化できるか。
RQ5このタスクに対するパノラマ機能 grounding ベンチマーク（360-AGD）の品質と有用性はどの程度か。

主な発見

Method	Supervision	Easy Split KLD (lower is better)	Easy Split SIM (higher is better)	Easy Split NSS (higher is better)	Hard Split KLD (lower is better)	Hard Split SIM (higher is better)	Hard Split NSS (higher is better)
OOAL	One-shot	2.868	0.117	1.267	3.067	0.097	1.484
OS-AGDO	One-shot	2.853	0.124	1.299	2.965	0.115	1.484
Ours	One-shot	1.270	0.506	4.490	1.306	0.474	4.398

PanoAffordanceNet は 360-AGD において Easy/Hard の両スプリットで KLD、SIM、NSS の各指標で OOAL, OS-AGDO の二つのワンショットベースラインを大幅に上回る。
アブレーションにより LoRA、DASM、OSDH のそれぞれが成果に寄与し、全モデルで最良の KLD および SIM を達成。
多段階損失（BCE、KL、RTC）はピクセル精度、分布的一貫性、領域-テキスト整合性を共同で改善し、総合指標が最も強力である。
透視 AGD20K でも競争力のある性能を維持しており、ドメイン横断の頑健性を示す。
360-AGD は、パノラマシーンの 19 種類の機能クラスとマルチ領域注釈を含む新しいベンチマークを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。