QUICK REVIEW

[論文レビュー] KLDrive: Fine-Grained 3D Scene Reasoning for Autonomous Driving based on Knowledge Graph

Ye Tian, Jingyi Zhang|arXiv (Cornell University)|Mar 22, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

KLDriveは知識グラフ強化型LLM推論フレームワークを提案し、エネルギーベースのシーン事実構築と制約のあるPlan–Execute–ObserveループのLLMエージェントを結合することで、NuScenes-QAとGVQAにおいて最先端の成果を達成します。

ABSTRACT

Autonomous driving requires reliable reasoning over fine-grained 3D scene facts. Fine-grained question answering over multi-modal driving observations provides a natural way to evaluate this capability, yet existing perception pipelines and driving-oriented large language model (LLM) methods still suffer from unreliable scene facts, hallucinations, opaque reasoning, and heavy reliance on task-specific training. We present KLDrive, the first knowledge-graph-augmented LLM reasoning framework for fine-grained question answering in autonomous driving. KLDrive addresses this problem through designing two tightly coupled components: an energy-based scene fact construction module that consolidates multi-source evidence into a reliable scene knowledge graph, and an LLM agent that performs fact-grounded reasoning over a constrained action space under explicit structural constraints. By combining structured prompting with few-shot in-context exemplars, the framework adapts to diverse reasoning tasks without heavy task-specific fine-tuning. Experiments on two large-scale autonomous-driving QA benchmarks show that KLDrive outperforms prior state-of-the-art methods, achieving the best overall accuracy of 65.04% on NuScenes-QA and the best SPICE score of 42.45 on GVQA. On counting, the most challenging factual reasoning task, it improves over the strongest baseline by 46.01 percentage points, demonstrating substantially reduced hallucinations and the benefit of coupling reliable scene fact construction with explicit reasoning.

研究の動機と目的

Driving decisionsのための細粒度3D運転シーン事実（物体同定、動作、空間関係）に対する信頼性のある推論を動機づける。
タスク固有のファインチューニングなしで解釈可能かつ事実に基づく推論を提供する知識グラフ強化フレームワークを開発する。
Few-shot in-context learningと制約されたアクション空間を介して多様な推論タスクへの堅牢な適応を実現する。
構造化されたシーンKGと明示的なツール使用に基づくLLM推論を用いて幻覚を軽減する。

提案手法

二段階のKLDriveパイプラインを導入：(i) エネルギーに基づくシーン事実構築により複数ソース証拠から信頼できるシーン知識グラフ（シーンKG）を構築する。 (ii) 制約されたアクション空間を持つPlan–Execute–ObserveループでKG上の推論を行うLLMエージェント。
カメラとLiDAR検出器（RayDN、FocalFormer3D、IS-Fusion）からの多源証拠を、ソース間プーリングと時間的回復を用いて統合候補セットとして形成する。
候補をエネルギーに基づくモデルで洗練し、キープ、ペアワイズ相互作用、属性、時系列/文脈支援を同時に考慮して一貫したシーンKGを生成する。
すべてのペアを具現化せずにKG内のオブジェクト間関係を符号化するコンパクトな関係演算子ライブラリを構築する。
Have-in-context学習を用いたLLMプランナーを用い、 boundedなシーン問合せ代数（Resolve、RelSelect、Intersect、Count、Exists、GetType、GetStatus、SameStatus）上の実行可能操作に質問を分解する。
LLMをPlan–Execute–Observeループで動作させ、シーン事実に基づく監査可能な推論トレースを取得する。

実験結果

リサーチクエスチョン

RQ1ノイズの多いマルチモーダルデータから細粒度3D運転シーンに関する信頼性の高い事実基盤推論をどのように実現するか？
RQ2制約付きツールとエネルギー基準の洗練を備えたKG強化LLMは自動運転QAにおける幻覚を減らし、解釈性を向上させ得るか？
RQ3KLDriveは大規模な運転QAベンチマーク（NuScenes-QAとGVQA）で、タスク固有のファインチューニングを多用せずにどの程度の性能を示すか？
RQ4正確なシーン事実構築と制約推論が、カウントなどの挑戦的な事実タスクにどのような影響を与えるか？

主な発見

KLDriveはNuScenes-QAで総合精度65.04%を達成し、最も強力なベースライン60.17%を上回る。
KLDriveはGVQAで最良のSPICEスコア42.45を達成。
知覚的事実が完全に正しい場合、KLDriveの総合精度は84.49%に達する。
カウントという最も難しい事実推論タスクで、KLDriveは64.46%の精度を達成し、最強ベースラインを46.01ポイント上回った。
エネルギー基盤の洗練と事実に基づく、ツール駆動のLLM推論は、エンドツーエンド手法と比べて幻覚を大幅に低減する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。