QUICK REVIEW

[Paper Review] Composing Meta-Policies for Autonomous Driving Using Hierarchical Deep Reinforcement Learning

Richard Liaw, Sanjay Krishnan|arXiv (Cornell University)|Nov 4, 2017

Reinforcement Learning in Robotics23 references17 citations

TL;DR

This paper proposes a hierarchical deep reinforcement learning approach to compose meta-policies from pre-trained basis policies for autonomous driving in partially observed, noisy environments. By using a GRU-based meta-policy to dynamically select among fixed controllers, the method achieves 2.6x higher reward than the next best composition technique and reduces exploration by 80% in fully observed settings, while converging in 50 iterations where standard RL fails even after 200 iterations in partially observed scenarios.

ABSTRACT

Rather than learning new control policies for each new task, it is possible, when tasks share some structure, to compose a "meta-policy" from previously learned policies. This paper reports results from experiments using Deep Reinforcement Learning on a continuous-state, discrete-action autonomous driving simulator. We explore how Deep Neural Networks can represent meta-policies that switch among a set of previously learned policies, specifically in settings where the dynamics of a new scenario are composed of a mixture of previously learned dynamics and where the state observation is possibly corrupted by sensing noise. We also report the results of experiments varying dynamics mixes, distractor policies, magnitudes/distributions of sensing noise, and obstacles. In a fully observed experiment, the meta-policy learning algorithm achieves 2.6x the reward achieved by the next best policy composition technique with 80% less exploration. In a partially observed experiment, the meta-policy learning algorithm converges after 50 iterations while a direct application of RL fails to converge even after 200 iterations.

Motivation & Objective

To address the challenge of controlling autonomous vehicles with unknown or mixed dynamical regimes by composing existing policies rather than retraining from scratch.
To improve sample efficiency and convergence speed in reinforcement learning by leveraging previously trained policies as basis policies.
To handle partial observability due to sensing noise by using recurrent neural networks (GRUs) in the meta-policy to maintain memory of past observations.
To evaluate the robustness of meta-policy learning under varying dynamics mixes, distractor policies, and noise distributions in a simulated driving environment.
To compare meta-policy learning against direct RL and ensemble methods in terms of reward, convergence speed, and sample efficiency.

Proposed method

A meta-policy is learned using deep reinforcement learning, where the action space is discrete selection among k pre-trained basis policies (e.g., cruise control for new vs. old cars).
The meta-policy is represented by a Gated Recurrent Unit (GRU) to model temporal dependencies and handle partial observability by maintaining a memory of past states and observations.
The basis policies are fixed and pre-trained on known dynamical regimes (e.g., cars with different wear levels), and the meta-policy learns when to apply each based on current state observations.
Training uses a policy gradient method with a discount factor of 0.995, batch size of 1000–2000, and learning rate of 0.001 to optimize the meta-policy's selection strategy.
Experiments are conducted in a continuous-state, discrete-action driving simulator with varying dynamics mixes, sensing noise, and obstacle configurations.
The method is compared against direct RL, voting ensembles, confidence ensembles, and multi-armed bandit baselines to evaluate performance and sample efficiency.

Experimental results

Research questions

RQ1Can a meta-policy composed from pre-trained basis policies achieve higher sample efficiency and faster convergence than direct reinforcement learning in autonomous driving tasks with unknown dynamics?
RQ2How does the meta-policy perform under partial observability due to sensing noise, and can recurrent networks improve performance compared to non-recurrent models?
RQ3What is the impact of including irrelevant or suboptimal distractor policies on the meta-policy’s ability to converge and achieve high reward?
RQ4How does the reward shaping (e.g., linear vs. quadratic distance penalty) affect the convergence rate of meta-policy learning versus direct RL?
RQ5Can meta-policy learning outperform simple ensemble or bandit-based selection strategies in terms of both reward and exploration efficiency?

Key findings

In the fully observed setting, the meta-policy learning approach achieved 2.6x the reward of the next best policy composition technique and required 80% less exploration than direct RL.
In the partially observed setting, the meta-policy converged to a high-reward policy in approximately 50 iterations, while direct RL failed to converge even after 200 iterations.
The meta-policy outperformed both voting ensembling (31.92 vs. 87.90) and confidence ensembling (10.32 vs. 87.90) in terms of final reward, with the direct RL baseline achieving 89.16 after 500 iterations.
The use of GRUs in the meta-policy enabled effective handling of partial observability by maintaining memory of past observations, improving robustness to sensing noise.
The convergence rate of meta-policy learning improved with stronger reward shaping, suggesting it is most beneficial in sparse or delayed-reward environments.
A multi-armed bandit baseline (UCB) with 3 distractor policies achieved correct policy selection in 4,000 steps—two orders of magnitude faster than hierarchical RL—indicating potential for hybrid initialization strategies.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.