QUICK REVIEW

[论文解读] Deep Reinforcement Learning for Autonomous Driving

Wang, Sen, Daoyuan Jia|arXiv (Cornell University)|Nov 28, 2018

Reinforcement Learning in Robotics参考文献 16被引用 44

一句话总结

{"text":"本文在 TORCS 模拟器中应用 Deep Deterministic Policy Gradient (DDPG) 于自动驾驶，设计了一套自定义传感器输入集合和奖励函数，以处理连续动作空间和安全约束。"}

ABSTRACT

Reinforcement learning has steadily improved and outperform human in lots of traditional games since the resurgence of deep neural network. However, these success is not easy to be copied to autonomous driving because the state spaces in real world are extreme complex and action spaces are continuous and fine control is required. Moreover, the autonomous driving vehicles must also keep functional safety under the complex environments. To deal with these challenges, we first adopt the deep deterministic policy gradient (DDPG) algorithm, which has the capacity to handle complex state and action spaces in continuous domain. We then choose The Open Racing Car Simulator (TORCS) as our environment to avoid physical damage. Meanwhile, we select a set of appropriate sensor information from TORCS and design our own rewarder. In order to fit DDPG algorithm to TORCS, we design our network architecture for both actor and critic inside DDPG paradigm. To demonstrate the effectiveness of our model, We evaluate on different modes in TORCS and show both quantitative and qualitative results.

研究动机与目标

激发并解决将深度强化学习应用于具有连续动作和复杂状态的自动驾驶所面临的挑战。
在 TORCS 中评估基于 DDPG 的智能体，以学习快速、安全的驾驶策略。
设计适用于 TORCS 和连续控制的传感器输入及定制奖励函数。
在 DDPG 框架内为自动驾驶任务开发 actor-critic 网络架构。

提出的方法

使用 DDPG 学习方向盘转向、加速和制动的连续控制策略。
从 TORCS 选择一个 29 维传感器输入向量作为状态表示。
定义一个奖励函数，偏好沿轨道的速度，惩罚偏离轨道中心和垂直速度分量。
设计具有特定结构布置的 actor 与 critic 网络以及经验回放策略。
结合目标网络和软更新以稳定学习。

实验结果

研究问题

RQ1DDPG 能否在仿真器中学习到有效的自动驾驶连续控制策略？
RQ2传感器输入和奖励设计应如何针对 TORCS 进行定制以促进学习？
RQ3哪种网络架构和稳定化技术（例如目标网络、重放缓冲区）能够提高该任务的学习效率？
RQ4智能体在不同 TORCS 模式（训练与竞赛）及不同驾驶情景中的表现如何？

主要发现

基于 DDPG 的智能体能够在 TORCS 模拟器中实现快速驾驶，同时在训练设置中保持功能性安全。
训练在若干回合中显示平均速度和步增不断增加，在大约 100 次回合后稳定。
智能体学会在弯道前减速以降低漂移并改善转弯性能。
竞赛模式下的表现表明智能体能够在转弯处超越对手并适应不断演变的情景。
训练行为包括智能体短暂停滞或漂移的情节，突出了环境因素导致的稳定性问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。