QUICK REVIEW

[論文レビュー] OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

Zhongyuang Liu, Min He|arXiv (Cornell University)|Mar 18, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

OmniVLN は全方位 360° センシングを階層的な 3D シーン・グラフとトークン効率のよい LLM プロンプティングと統合し、現実世界の屋内環境での空中・地上ロボット間の視覚-言語ナビゲーションを実行します。

ABSTRACT

Language-guided embodied navigation requires an agent to interpret object-referential instructions, search across multiple rooms, localize the referenced target, and execute reliable motion toward it. Existing systems remain limited in real indoor environments because narrow field-of-view sensing exposes only a partial local scene at each step, often forcing repeated rotations, delaying target discovery, and producing fragmented spatial understanding; meanwhile, directly prompting LLMs with dense 3D maps or exhaustive object lists quickly exceeds the context budget. We present OmniVLN, a zero-shot visual-language navigation framework that couples omnidirectional 3D perception with token-efficient hierarchical reasoning for both aerial and ground robots. OmniVLN fuses a rotating LiDAR and panoramic vision into a hardware-agnostic mapping stack, incrementally constructs a five-layer Dynamic Scene Graph (DSG) from mesh geometry to room- and building-level structure, and stabilizes high-level topology through persistent-homology-based room partitioning and hybrid geometric/VLM relation verification. For navigation, the global DSG is transformed into an agent-centric 3D octant representation with multi-resolution spatial attention prompting, enabling the LLM to progressively filter candidate rooms, infer egocentric orientation, localize target objects, and emit executable navigation primitives while preserving fine local detail and compact long-range memory. Experiments show that the proposed hierarchical interface improves spatial referring accuracy from 77.27\% to 93.18\%, reduces cumulative prompt tokens by up to 61.7\% in cluttered multi-room settings, and improves navigation success by up to 11.68\% over a flat-list baseline. We will release the code and an omnidirectional multimodal dataset to support reproducible research.

研究の動機と目的

多室の屋内環境におけるロバストな言語誘導の embodiment ナビゲーションを動機づける。
全方位認識を統合して一貫した 3D マップを実現し、狭い視野感知の制約を克服する。
Dense な全方位マップを圧縮して、スケーラブルな LLM 推論のためのコンパクトな階層表現に変換する。
統一された perception-to-action スタックを介して異種プラットフォーム間のゼロショットナビゲーションを実現する。
再現性のある研究を支援するオープンデータセットとオープンソースのフレームワークを提供する。

提案手法

回転 LiDAR とパノラマ視覚を融合して 360° セマンティック 3D マップを構築する。
メッシュからビルディングレベルまでの五層 Dynamic Scene Graph (DSG) を段階的に構築する。
グローバル DSG をエージェント中心の 3D 八分割表現へ多解像度プロンプティングで変換する。
階層的な思考連鎖プロンプティングと Actor-Critic ループを用いて実行可能なナビゲーションプリミティブを生成する。
幾何剪定と VLM 検証によってエッジと物体関係を検証し、グラフの真実性を維持する。

Figure 1: OmniVLN, a zero-shot visual-language navigation (VLN) framework coupling $360^{\circ}$ 3D perception with token-efficient reasoning across aerial & ground platforms.

実験結果

リサーチクエスチョン

RQ1全方位的な 3D 認識は VLN における空間推論と物体定位にどのような影響を与えるか？
RQ2トークン効率の高い階層的推論アプローチは空中・地上プラットフォーム間の長期的ナビゲーションを改善できるか？
RQ3複雑な 3D 環境における LLM ベースの計画でエージェント中心の八分割表現はどのような利点をもたらすか？
RQ4Persistent-homology ベースの部屋分割はトポロジー的安定性と分割品質にどのような影響を与えるか？
RQ5高密度で複数室の設定に対するゼロショットナビゲーションにおいて、密な 3D マップとコンパクトな DSG 表現のトレードオフは何か？

主な発見

Representation	VI Accuracy	VD Accuracy	Overall
Non-Hierarchical (Uniform Flat)	86.36%	68.18%	77.27%
Hierarchical (Ours)	95.45%	90.91%	93.18%

階層的 DSG は空間参照の精度を 77.27% から 93.18% に向上させる。
トークン効率の高い多解像度プロンプティングにより、高密度・多室環境でエンドツーエンドのトークンを最大 69.98% 減少させる。
ナビゲーション成功率はフラットリストのベースラインより最大で 11.68% 向上。
回転 LiDAR は固定 LiDAR よりも完全なセマンティックマップを生成する（同じ経路でオブジェクトノード 85 vs 61）。
マルチ解像度プロンプティングと高速ジオメトリエンジンにより、意思決定あたりのエンドツーエンドレイテンシが 12.4s から 3.8s に低減される。

Figure 2: Overview of the proposed framework. The Multimodal Perception module (left) achieves $360^{\circ}$ spatio-temporal consistency by fusing data from a rotating LiDAR and panoramic fisheye cameras, enabling the generation of high-fidelity semantic point clouds across robotic platforms. The Hi

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。