QUICK REVIEW

[论文解读] PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning

Yupeng Zheng, Zebin Xing|arXiv (Cornell University)|Jun 3, 2024

Natural Language Processing Techniques被引用 5

一句话总结

PlanAgent 使用一个多模态大语言模型（MLLM），包含三个模块——环境转换、推理引擎和反思——以实现闭环中间级别的自动驾驶规划，在 nuPlan Val14 上达到最新的最先进结果，并在 Test14-hard 上具有强泛化能力。

ABSTRACT

Vehicle motion planning is an essential component of autonomous driving technology. Current rule-based vehicle motion planning methods perform satisfactorily in common scenarios but struggle to generalize to long-tailed situations. Meanwhile, learning-based methods have yet to achieve superior performance over rule-based approaches in large-scale closed-loop scenarios. To address these issues, we propose PlanAgent, the first mid-to-mid planning system based on a Multi-modal Large Language Model (MLLM). MLLM is used as a cognitive agent to introduce human-like knowledge, interpretability, and common-sense reasoning into the closed-loop planning. Specifically, PlanAgent leverages the power of MLLM through three core modules. First, an Environment Transformation module constructs a Bird's Eye View (BEV) map and a lane-graph-based textual description from the environment as inputs. Second, a Reasoning Engine module introduces a hierarchical chain-of-thought from scene understanding to lateral and longitudinal motion instructions, culminating in planner code generation. Last, a Reflection module is integrated to simulate and evaluate the generated planner for reducing MLLM's uncertainty. PlanAgent is endowed with the common-sense reasoning and generalization capability of MLLM, which empowers it to effectively tackle both common and complex long-tailed scenarios. Our proposed PlanAgent is evaluated on the large-scale and challenging nuPlan benchmarks. A comprehensive set of experiments convincingly demonstrates that PlanAgent outperforms the existing state-of-the-art in the closed-loop motion planning task. Codes will be soon released.

研究动机与目标

通过利用 MLLM 在闭环自动驾驶中的常识推理，弥合基于规则的规划与基于学习的规划之间的差距。
引入高效的环境转换，将复杂场景转换为鸟瞰图 BEV 地图和车道-图文本描述。
引入分层思维链式推理引擎，生成针对场景的基于 IDM 的规划器代码。
添加一个反思模块，用于模拟与评分规划器，以控制 MLLM 不确定性并提升安全性。

提出的方法

环境转换模块创建一个 BEV 地图（全局）和一个车道-图文本描述（局部），作为多模态提示。
推理引擎使用预定义的系统提示和分层的 CoT，通过在上下文学习生成基于 IDM 的规划器代码。
规划器代码被生成以调用 IDM，参数为 c、la、v0、acc、dec；随后由反思模块执行。
反思模块对生成的规划器进行仿真，评估安全性和效率；若评分低于阈值 lambda（0.75），最多执行 max_exec = 3 次以触发重新思考。
PlanAgent 使用带有车道-图文本描述的场景描述实现了更高的 token 效率，相较于竞争的基于 LLM 的方法，减少了提示 token。

实验结果

研究问题

RQ1一个基于 MLLM 的代理是否能够在常见和长尾驾驶场景中实现鲁棒的闭环运动规划？
RQ2环境转换+分层推理是否能够实现对 IDM 基础规划的可靠规划器代码生成？
RQ3基于反思的安全检查是否能在存在 MLLM 不确定性时提升规划的可靠性？
RQ4PlanAgent 与 NuPlan Val14 和 Test14-hard 基准上的最先进规则式、学习式和基于 LLM 的规划器相比如何？

主要发现

Planners	Val14 NR-CLS	Val14 R-CLS	Test14-hard NR-CLS	Test14-hard R-CLS
Expert (Log-replay)	94.03	75.86	85.96	68.80
Rule-based IDM [16]	70.39	72.42	56.16	62.26
PDM-Closed [2]	92.51	91.79	65.07	75.18
Learning-based RasterModel [1]	69.66	67.54	49.47	52.16
UrbanDriver [44]	63.27	61.02	51.54	49.07
GC-PGP [45]	55.99	51.39	43.22	39.63
PDM-Open [2]	52.80	57.23	33.51	35.83
GameFormer [46]	80.80	79.31	66.59	68.83
PlanTF [3]	84.83	76.78	72.68	61.70
DTPP [4]	89.64	89.78	59.44	62.94
LLM-ASSIST UNC * [33]	90.11	90.32	-	-
LLM-ASSIST PAR * [33]	93.05	92.20	-	-
PlanAgent (Ours)	93.26	92.75	72.51	76.82

PlanAgent 在 nuPlan Val14 上达到最先进的 NR-CLS 和 R-CLS（NR-CLS 93.26，R-CLS 92.75）。
PlanAgent 能泛化到长尾场景，在 Test14-hard 上达到 NR-CLS 72.51 和 R-CLS 76.82。
在 Token 使用方面，PlanAgent 的场景描述平均使用 141.32 个 token，优于 GPT-Driver（448.66）和 LLM-ASSIST（425.81）。
消融研究显示，增加 BEV 地图和车道-图可使 NR-CLS 提升 1.5–2.0 点，R-CLS 提升 1.2–1.6 点。
移除反思模块会显著降低 NR-CLS（如下降最多 2.67 点），显示了仿真与评分步骤的安全性收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。