QUICK REVIEW

[论文解读] KLDrive: Fine-Grained 3D Scene Reasoning for Autonomous Driving based on Knowledge Graph

Ye Tian, Jingyi Zhang|arXiv (Cornell University)|Mar 22, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

KLDrive 将能量基场景事实构建与受约束的计划–执行–观察LLM代理相结合，以实现细粒度3D驾驶场景QA，在NuScenes-QA和GVQA上达到最先进结果。

ABSTRACT

Autonomous driving requires reliable reasoning over fine-grained 3D scene facts. Fine-grained question answering over multi-modal driving observations provides a natural way to evaluate this capability, yet existing perception pipelines and driving-oriented large language model (LLM) methods still suffer from unreliable scene facts, hallucinations, opaque reasoning, and heavy reliance on task-specific training. We present KLDrive, the first knowledge-graph-augmented LLM reasoning framework for fine-grained question answering in autonomous driving. KLDrive addresses this problem through designing two tightly coupled components: an energy-based scene fact construction module that consolidates multi-source evidence into a reliable scene knowledge graph, and an LLM agent that performs fact-grounded reasoning over a constrained action space under explicit structural constraints. By combining structured prompting with few-shot in-context exemplars, the framework adapts to diverse reasoning tasks without heavy task-specific fine-tuning. Experiments on two large-scale autonomous-driving QA benchmarks show that KLDrive outperforms prior state-of-the-art methods, achieving the best overall accuracy of 65.04% on NuScenes-QA and the best SPICE score of 42.45 on GVQA. On counting, the most challenging factual reasoning task, it improves over the strongest baseline by 46.01 percentage points, demonstrating substantially reduced hallucinations and the benefit of coupling reliable scene fact construction with explicit reasoning.

研究动机与目标

为驱动决策建立对细粒度3D驾驶场景事实（对象身份、运动、空间关系）的可靠推理动机。
开发一个知识图谱增强的框架，提供可解释、基于事实的推理，无需任务专用微调。
通过少量示例的上下文学习与受限的行动空间，实现对多样推理任务的鲁棒适应。
通过将LLM推理 grounding 在结构化场景KG和显式工具使用中，减轻幻觉问题。

提出的方法

引入一个两阶段的KLDrive管线：(i) 能量基场景事实构建，从多源证据中构建可靠的场景知识图KG；(ii) 一个在受限行动空间中以Plan–Execute–Observe循环对KG进行推理的LLM代理。
汇聚来自相机与LiDAR探测器（RayDN、FocalFormer3D、IS-Fusion）的多源证据，进行跨源池化与时间恢复，形成统一的场景实体候选集。
用一个能量基模型对候选项进行 refined，同时综合保留、成对交互、属性以及时间/上下文支持，生成一致的场景KG。
构建一个紧凑的关系操作符库，以在KG中编码对象间的关系而不将所有对逐一显式化。
使用带上下文学习的LLM规划器，将问题分解为对受限场景查询代数（Resolve、RelSelect、Intersect、Count、Exists、GetType、GetStatus、SameStatus）中可执行的操作。
在Plan–Execute–Observe循环中运行LLM，以获得基于场景事实的可审核推理轨迹。

实验结果

研究问题

RQ1如何在来自嘈杂多模态数据的细粒度3D驾驶场景上实现可靠、基于事实的推理？
RQ2在具有限制工具和能量基 refinement 的KG增强LLM下，是否能减少幻觉并提升自主驾驶问答的可解释性？
RQ3KLDrive在NuScenes-QA和GVQA等大规模驾驶QA基准上，在不进行大量任务专用微调的情况下表现如何？
RQ4精确的场景事实构建与受限推理对计数及其他具挑战性的事实性任务的影响如何？

主要发现

KLDrive在NuScenes-QA上的总体准确度达到65.04%，超过 strongest baseline 60.17%。
KLDrive在GVQA上获得最佳SPICE分数42.45。
当感知事实完全正确时，KLDrive的总体准确度达到84.49%。
在计数这一最具挑战性的事实推理任务中，KLDrive达到64.46%的准确率，比最强基线提升46.01个百分点。
将能量基 refinement 与基于事实的、通过工具驱动的LLM推理结合，显著降低幻觉，相较端到端方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。