Skip to main content
QUICK REVIEW

[Paper Review] Modular Multitask Reinforcement Learning with Policy Sketches

Jacob Andreas, Dan Klein|arXiv (Cornell University)|Nov 6, 2016
Reinforcement Learning in Robotics30 references229 citations
TL;DR

The paper introduces a modular multitask reinforcement learning framework guided by abstract policy sketches, learning reusable subpolicies per high-level symbol and optimizing via a decoupled actor–critic with curriculum learning.

ABSTRACT

We describe a framework for multitask deep reinforcement learning guided by policy sketches. Sketches annotate tasks with sequences of named subtasks, providing information about high-level structural relationships among tasks but not how to implement them---specifically not providing the detailed guidance used by much previous work on learning policy abstractions for RL (e.g. intermediate rewards, subtask completion signals, or intrinsic motivations). To learn from sketches, we present a model that associates every subtask with a modular subpolicy, and jointly maximizes reward over full task-specific policies by tying parameters across shared subpolicies. Optimization is accomplished via a decoupled actor--critic training objective that facilitates learning common behaviors from multiple dissimilar reward functions. We evaluate the effectiveness of our approach in three environments featuring both discrete and continuous control, and with sparse rewards that can be obtained only after completing a number of high-level subgoals. Experiments show that using our approach to learn policies guided by sketches gives better performance than existing techniques for learning task-specific or shared policies, while naturally inducing a library of interpretable primitive behaviors that can be recombined to rapidly adapt to new tasks.

Motivation & Objective

  • Motivate learning hierarchical policies without grounding high-level actions in environment specifics.
  • Present a modular subpolicy architecture that associates each high-level symbol with a reusable subpolicy.
  • Develop a decoupled actor–critic training objective suitable for modular, multi-task policies.
  • Demonstrate training with curriculum learning and assess generalization to zero-shot and adaptation settings.

Proposed method

  • Annotate tasks with sketches consisting of sequences of high-level symbols.
  • Associate each symbol with a dedicated subpolicy and share subpolicies across tasks using the same symbol.
  • Treat each task policy as a concatenation of its subtasks, executed with a stop mechanism to advance to the next subpolicy.
  • Use a decoupled actor–critic objective with a task- and state-dependent critic to reduce gradient variance.
  • Incorporate curriculum learning to progressively handle longer sketches and harder tasks.

Experimental results

Research questions

  • RQ1Can policy sketches provide sufficient guidance to enable fast, modular learning across multiple tasks without grounding details?
  • RQ2Do shared subpolicies learned from sketches improve sample efficiency and performance compared to non-modular baselines?
  • RQ3How do zero-shot and adaptation scenarios perform when using modular subpolicies guided by sketches?
  • RQ4What is the impact of curriculum design and task- and state-dependent baselines on learning efficiency?

Key findings

  • Modular sketch-guided learning substantially outperforms baselines that learn task-specific or fully shared policies across crafting, maze, and cliff environments.
  • The approach induces an interpretable library of primitive policies that can be recombined to tackle new tasks.
  • Joint training with state- and task-dependent critics yields faster convergence than constant baselines.
  • Curriculum components (length-based and reward-based task sampling) improve convergence rates.
  • Zero-shot and adaptation experiments show strong generalization where baselines struggle.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.