QUICK REVIEW

[论文解读] Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Shai Shalev‐Shwartz, Shaked Shammah|arXiv (Cornell University)|Oct 11, 2016

Reinforcement Learning in Robotics参考文献 31被引用 367

一句话总结

论文提出一个安全强化学习框架用于自动驾驶，将学习欲望与硬约束轨迹规划分离，并使用一个 Option Graph 进行分层时间抽象以降低方差和样本复杂度，在一个具有挑战性的双并线场景中展示。

ABSTRACT

Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy. Moreover, one must balance between unexpected behavior of other drivers/pedestrians and at the same time not to be too defensive so that normal traffic flow is maintained. In this paper we apply deep reinforcement learning to the problem of forming long term driving strategies. We note that there are two major challenges that make autonomous driving different from other robotic tasks. First, is the necessity for ensuring functional safety - something that machine learning has difficulty with given that performance is optimized at the level of an expectation over many instances. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable behavior of other agents in this multi-agent scenario. We make three contributions in our work. First, we show how policy gradient iterations can be used without Markovian assumptions. Second, we decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving. Third, we introduce a hierarchical temporal abstraction we call an "Option Graph" with a gating mechanism that significantly reduces the effective horizon and thereby reducing the variance of the gradient estimation even further.

研究动机与目标

在多智能体交通中解决学习策略的功能安全问题。
通过避免依赖严格的 MDP 假设来处理非马尔可夫和多智能体动力学。
开发一种学习框架，在通过硬约束保证安全的同时实现舒适驾驶。
引入分层时间抽象以降低梯度方差和样本复杂度。

提出的方法

将策略分解为可学习的 Desires 策略和带有硬安全约束的非学习的轨迹规划器。
使用不需要马尔可夫假设的策略梯度方法，并结合方差削减技术。
引入一个 Option Graph 以提供时间抽象和门控，降低视界和方差。
将 Desires 参数化为乘积空间 [0, v_max] × L × {g,t,o}^n，以捕捉速度、车道位置和交互。
将 Desires 转化为带有硬约束的轨迹代价函数，以保证安全。

实验结果

研究问题

RQ1在驾驶多智能体场景中，策略梯度强化学习在没有马尔可夫假设的情况下是否仍能有效运作？
RQ2在不牺牲学习效率的情况下，如何在自动驾驶 RL 中确保功能安全？
RQ3通过 Option Graph 的分层时间抽象是否降低梯度方差并提高驾驶策略的样本效率？
RQ4Desires-to-trajectory 分解是否能够在复杂合并场景下实现安全、舒适的驾驶？

主要发现

策略梯度可以在没有马尔可夫假设的情况下为自动驾驶而提出；无偏的梯度估计仍然可行。
通过将策略分解为 Desires（学习）和一个确定性、以约束为导向的轨迹规划器来实现安全。
An Option Graph 提供分层决策，降低有效视界和梯度方差，从而提高样本效率。
Desires-to-trajectory 框架使在诸如双重合并等具有挑战性的动作下也能实现功能安全保障的驾驶成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。