QUICK REVIEW

[论文解读] Query-Efficient Imitation Learning for End-to-End Autonomous Driving

Jiakai Zhang, Kyunghyun Cho|arXiv (Cornell University)|May 20, 2016

Reinforcement Learning in Robotics参考文献 15被引用 115

一句话总结

SafeDAgger 在 DAgger 的基础上加入安全策略，减少对参考策略的查询，从而实现端到端自动驾驶的更高查询效率的模仿学习，并在 TORCS 仿真中实现更快、更安全的收敛。

ABSTRACT

One way to approach end-to-end autonomous driving is to learn a policy function that maps from a sensory input, such as an image frame from a front-facing camera, to a driving action, by imitating an expert driver, or a reference policy. This can be done by supervised learning, where a policy function is tuned to minimize the difference between the predicted and ground-truth actions. A policy function trained in this way however is known to suffer from unexpected behaviours due to the mismatch between the states reachable by the reference policy and trained policy functions. More advanced algorithms for imitation learning, such as DAgger, addresses this issue by iteratively collecting training examples from both reference and trained policies. These algorithms often requires a large number of queries to a reference policy, which is undesirable as the reference policy is often expensive. In this paper, we propose an extension of the DAgger, called SafeDAgger, that is query-efficient and more suitable for end-to-end autonomous driving. We evaluate the proposed SafeDAgger in a car racing simulator and show that it indeed requires less queries to a reference policy. We observe a significant speed up in convergence, which we conjecture to be due to the effect of automated curriculum learning.

研究动机与目标

通过参考策略的模仿学习来推动端到端自动驾驶。
解决当参考策略成本高（例如人类驾驶员）时 DAgger 的高查询成本。
提出 SafeDAgger，它是 DAgger 的一个高查询效率扩展，带有安全策略以最小化对参考策略的查询。
通过 TORCS 仿真演示 SafeDAgger 能加速收敛并减少碰撞/损坏。
凸显由安全评估引导的子集选择所带来的自动化课程学习效应。

提出的方法

引入一个安全策略，在不查询参考策略的情况下预测主策略何时可能偏离参考策略。
定义偏差 ε(π,π*,φ(s)) = ||π(φ(s)) − π*(φ(s))||^2 和阈值 τ 以形成 π_safe*。
将安全策略集成到 SafeDAgger 循环中，仅收集难例（安全策略返回 0 的情况）以查询参考策略。
使用子集选择在数据收集期间限制查询的状态，实现数据效率和类似课程的学习。
维持类似于 DAgger 的 learning-to-search 框架，在迭代中同时更新主策略和安全策略。
应用于 TORCS，使用预测转向、刹车和可用性(affordances) 的深度 CNN 主策略，以及预测安全/不安全驾驶决策的安全策略。

实验结果

研究问题

RQ1与标准 DAgger 相比，SafeDAgger 是否在端到端驾驶中减少对参考策略的查询次数？
RQ2在仿真驾驶环境中，SafeDAgger 是否能比监督学习或 DAgger 实现更快的收敛和更好的驾驶性能（更少的撞车、损坏更低）？
RQ3安全策略是否实现了有意义的自动化课程学习，从而提高数据效率和策略质量？
RQ4在 TORCS 的有交通和无交通条件下，SafeDAgger 的表现如何？
RQ5将安全策略概念推广到除 DAgger 以外的其他模仿学习框架是否可行？

主要发现

SafeDAgger 在培训过程中对参考策略的查询次数显著少于原始 DAgger。
经过三次迭代，SafeDAgger 训练的策略在 TORCS 设置中达到近乎完美的驾驶。
安全策略在测试阶段减少了参考策略的使用时间，在早期阶段无交通时为 7.11%，有交通时为 10.81%。
在所报告的设置中，约 77.70% 的训练样本被安全策略判定为安全。
与原生 DAgger 相比，SafeDAgger 展现出更快的收敛和对参考策略依赖度更明显的下降趋势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。