QUICK REVIEW

[论文解读] Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic

Mikael Henaff, Alfredo Canziani|arXiv (Cornell University)|Jan 8, 2019

Gaussian Processes and Bayesian Inference参考文献 40被引用 78

一句话总结

论文通过训练一个随机前向模型和一个优化策略成本加上不确定性正则化的策略，学习纯观测数据中的驾驶策略，允许通过学习的动力学进行多步反向传播而无需环境互动。

ABSTRACT

Learning a policy using only observational data is challenging because the distribution of states it induces at execution time may differ from the distribution observed during training. We propose to train a policy by unrolling a learned model of the environment dynamics over multiple time steps while explicitly penalizing two costs: the original cost the policy seeks to optimize, and an uncertainty cost which represents its divergence from the states it is trained on. We measure this second cost by using the uncertainty of the dynamics model about its own predictions, using recent ideas from uncertainty estimation for deep networks. We evaluate our approach using a large-scale observational dataset of driving behavior recorded from traffic cameras, and show that we are able to learn effective driving policies from purely observational data, with no environment interaction.

研究动机与目标

Motivate learning policies from observational driving data where environment interaction is costly or dangerous.
Propose a two-stage approach: learn an action-conditional forward model from data, then train a policy with backpropagation through the unrolled model.
Introduce an uncertainty cost based on model prediction uncertainty to discourage states far from training data.
Demonstrate that uncertainty-regularized model-based planning improves policy quality in dense-traffic driving.
Release dataset and environment to facilitate further research.

提出的方法

Train an action-conditional forward model f_theta(s_1:t, a_t, z_t) using a VAE-style latent z_t and a posterior q_phi with z-dropout to keep responsiveness to actions.
Unroll the forward model over horizon T and backpropagate a differentiable loss combining a policy cost C and an uncertainty cost U, where U is the trace of the covariance of forward predictions under multiple dropout masks.
Estimate uncertainty U via dropout-based approximations, computing Var across K forward passes and taking the trace: U(s_hat_{t+1}) = tr(Cov[{f_theta_k(s_1:t,a_t,z_t)}_{k=1}^K]).
Optionally relate the forward model to Bayesian neural networks and describe how the posterior over latent variables and dropout weights approximates the true posterior.
Define two variants: MPUR (Model-Predictive Policy with Uncertainty Regularization) and MPER (Model-Predictive Policy with Expert Regularization).
Apply the learned dynamics to train a stochastic policy pi_psi by gradient-based optimization over rolled-out trajectories with gradients flowing through the unrolled model.

实验结果

研究问题

RQ1Can a policy be learned from purely observational driving data without environment interaction by penalizing divergence from training data trajectories?
RQ2Does incorporating an uncertainty regularizer in the forward-model-based policy learning improve performance in dense-traffic driving?
RQ3How does a modified posterior for latent variables (z-dropout) affect responsiveness to actions and policy performance in stochastic dynamics models?
RQ4What is the impact of longer rollout horizons on matching the induced state distribution to the training manifold?

主要发现

MPUR and MPER policies substantially outperform baselines learned from observational data, including imitation-learning and SVG/VG variants.
Including the uncertainty cost is essential; removing it (VG) yields high uncertainty and poorer real-environment performance.
Longer rollout horizons markedly improve policy performance across methods, with stochastic models and z-dropout providing the best gains.
The modified posterior with z-dropout improves responsiveness to actions and boosts policy success when using stochastic dynamics.
The MPUR approach can achieve near-human performance on dense traffic driving tasks within the observational-data regime.
Policy and environment results are supported by quantitative metrics and qualitative trajectory analysis.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。