QUICK REVIEW

[论文解读] LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers

Fabian Schmidt, Karol Fedurko|arXiv (Cornell University)|Mar 2, 2026

Autonomous Vehicle Technology and Safety被引用 0

一句话总结

LAD-Drive 通过动作解码器与动作感知扩散解码器，将语言驱动的高层意图与底层运动规划解耦，在 LangAuto 上实现多模态条件下的最先进结果。

ABSTRACT

While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle's kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on https://github.com/iis-esslingen/lad-drive.

研究动机与目标

推动整合语言推理与轨迹规划的多模态自动驾驶。
解决离散语言标记与连续运动之间的模态差距。
通过显式建模概率性的高层动作分布，防止模式平均化。
在语义意图与车辆运动学上对基于扩散的轨迹生成进行约束与对齐。

提出的方法

引入一个由大语言模型（LLM）推导上下文得到的显式概率性元动作（信念）状态，由动作解码器学习。
在状态-意图表示下，将扩散式轨迹解码器条件化在信念状态与自车状态上。
通过与领域对齐的运动锚点（由 k-means 聚类学习）来初始化带两步去噪的截断扩散过程。
应用特征瓶颈，将高维的LLM嵌入映射到紧凑潜在空间以实现稳定的扩散基础。
分两阶段训练：先进行空间对位以学习可行路径，然后进行语义对齐以将行动与轨迹连接。

实验结果

研究问题

RQ1如何从语言和感知中可靠地解耦出语义意图与低层轨迹执行？
RQ2在一个条件化为 probabilistic 元动作的动作感知扩散模型下，是否优于单热编码或隐式条件的多模态轨迹规划？
RQ3领域对齐的运动先验与语义特征瓶颈对规划稳定性和安全性的影响如何？
RQ4显式状态-意图条件化是否提升 LangAuto 设置中的运动可行性与对指令的遵循性？

主要发现

Method	DS (Tiny)	RC (Tiny)	IS (Tiny)	DS (Short)	RC (Short)	IS (Short)	DS (Long)	RC (Long)	IS (Long)	DS (Mean)	RC (Mean)	IS (Mean)
LAD-Drive (Ours)	83.5	87.0	0.95	71.3	78.1	0.89	49.8	58.3	0.86	68.2	74.5	0.90
LMDrive (Reported)	66.5	77.9	0.85	50.6	60.0	0.84	36.2	46.6	0.81	51.1	61.5	0.83
LMDrive (Checkpoint)	60.7	71.0	0.82	41.3	57.0	0.79	26.8	36.5	0.77	42.9	54.8	0.79
AD-H	77.5	85.1	0.91	56.1	68.0	0.78	44.0	53.2	0.83	59.2	68.8	0.84
BEVDriver	70.2	81.3	0.87	66.7	77.8	0.87	48.9	59.7	0.82	61.9	72.9	0.85
SToRM	78.8	86.9	0.92	64.5	74.7	0.88	44.2	56.8	0.82	62.5	72.8	0.87
VLDrive	81.9	85.5	0.94	67.4	78.1	0.85	43.8	54.5	0.84	64.4	72.7	0.88
AdaDrive	80.9	87.6	0.90	70.6	85.3	0.81	42.9	53.4	0.82	64.8	75.4	0.84

LAD-Drive 在 LangAuto 上实现了平均 Driving Score (DS) 68.2 的最先进水平，较 LMDrive 基线提升了 59%。
相较于 LMDrive 检查点，LAD-Drive 将路线偏离（RD）从 11.95 降至 2.31，动态主体的违反率（CV 从 2.83 降至 0.67，CP 从 0.08 降至 0.02）。
显式横向动作条件化联动自车“自洽地 grounding”在 Ablation 中表现最佳（DS 68.2，RC 74.5，IS 0.90）。
该架构显著提升路线完成度（RC）与驾驶分数，同时在保持竞争性延迟（47.04 毫秒）的前提下，将解码器参数量较 LMDrive 基线减少 1.48M。
两阶段训练策略（先进行空间对位再进行语义对齐）对于在物理可行轨迹与语义真实感之间的平衡至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。