QUICK REVIEW

[论文解读] Deep Imitative Models for Flexible Inference, Planning, and Control

Nicholas Rhinehart, Rowan McAllister|arXiv (Cornell University)|Oct 15, 2018

Reinforcement Learning in Robotics参考文献 39被引用 55

一句话总结

模仿模型学习专家轨迹的概率模型，并在测试时使用目标似然来规划类似专家的轨迹以实现灵活目标，结合 imitation learning 与 planning 而不进行 reward engineering。

ABSTRACT

Imitation Learning (IL) is an appealing approach to learn desirable autonomous behavior. However, directing IL to achieve arbitrary goals is difficult. In contrast, planning-based algorithms use dynamics models and reward functions to achieve goals. Yet, reward functions that evoke desirable behavior are often difficult to specify. In this paper, we propose Imitative Models to combine the benefits of IL and goal-directed planning. Imitative Models are probabilistic predictive models of desirable behavior able to plan interpretable expert-like trajectories to achieve specified goals. We derive families of flexible goal objectives, including constrained goal regions, unconstrained goal sets, and energy-based goals. We show that our method can use these objectives to successfully direct behavior. Our method substantially outperforms six IL approaches and a planning-based approach in a dynamic simulated autonomous driving task, and is efficiently learned from expert demonstrations without online data collection. We also show our approach is robust to poorly specified goals, such as goals on the wrong side of the road.

研究动机与目标

Motivate the need for flexible goal-directed control beyond traditional imitation learning and reward-based planning.
Propose a probabilistic imitative model that forecasts expert trajectories conditioned on scene observations.
Develop a planning objective that combines the imitation prior with a test-time goal likelihood to produce expert-like plans.
Showcase robustness and flexibility across various goal specifications and test-time conditions in autonomous driving.

提出的方法

Train an imitative model q(S1:T | φ) to forecast expert trajectories from offline demonstrations.
Use a probabilistic trajectory density (R2P2-based autoregressive flow) to model expert-like behavior and enable gradient-based planning.
Formulate a maximum a posteriori planning objective: s* = argmax_s log q(s|φ) + log p(G|s,φ) - log p(G|φ).
Construct diverse goal likelihoods p(G|s,φ): Final-State Indicator (region/line/point constraints), Gaussian Final-State (single or multiple future states), Gaussian State Sequence, and Gaussian Final-State Mixture with optional test-time costs.
Instantiate the model for autonomous driving in CARLA using route waypoints as goals, LIDAR/camera inputs, and a PID low-level controller.
Employ an attention-augmented neural architecture (mθ, σθ) to parameterize q(S|φ) with inputs including past states, perception χ, traffic signal λ, and latent Z.

实验结果

研究问题

RQ1离线训练的模仿模型是否能够在不进行 reward engineering 的情况下生成可解释、类似专家的多步骤计划？
RQ2该方法在测试时实现训练中未见过的目标（例如基于区域的目标、避坑等）有多灵活？
RQ3目标规格对噪声、错误指定的目标或诱导导航点的鲁棒性如何？
RQ4在使用标准传感器输入的动态自主驾驶基准（CARLA）上，所提方法是否取得了最先进的性能？
RQ5将测试时成本或各种目标似然形式纳入对规划质量的影响是？

主要发现

该方法在 CARLA 中无需 reward engineering 即可产生可解释、类似专家的多步规划，并且在六种模仿学习方法和一个基于规划的基线上表现更好。
在不同目标似然下的模仿规划在训练和测试条件下均表现出色，包括动态场景。
该方法对嘈杂或错误指定的目标具有鲁棒性，包括在错误车道一侧的目标和诱导航点。
在实验中，方法在使用常见的自主驾驶输入（航路点和 LIDAR）时达到最先进或有竞争力的 CARLA 性能。
测试时成本（如避免坑洞）可被纳入，以产生在训练中未展示的安全、目标导向行为。
该框架通过利用模仿先验和目标似然，支持在无需重新训练的情况下适应新任务的灵活性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。