QUICK REVIEW

[论文解读] Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Kate Rakelly, Aurick Zhou|arXiv (Cornell University)|Mar 19, 2019

Reinforcement Learning in Robotics被引用 227

一句话总结

PEARL 引入了一种使用概率潜在上下文以快速适应新任务的离策略元强化学习算法，在六个连续控制基准测试中实现了 20-100X 的元训练样本效率提升并改进的渐近性能。

ABSTRACT

Deep reinforcement learning algorithms require large amounts of experience to learn an individual task. While in principle meta-reinforcement learning (meta-RL) algorithms enable agents to learn new skills from small amounts of experience, several major challenges preclude their practicality. Current methods rely heavily on on-policy experience, limiting their sample efficiency. The also lack mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness in sparse reward problems. In this paper, we address these challenges by developing an off-policy meta-RL algorithm that disentangles task inference and control. In our approach, we perform online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience. This probabilistic interpretation enables posterior sampling for structured and efficient exploration. We demonstrate how to integrate these task variables with off-policy RL algorithms to achieve both meta-training and adaptation efficiency. Our method outperforms prior algorithms in sample efficiency by 20-100X as well as in asymptotic performance on several meta-RL benchmarks.

研究动机与目标

通过转向离策略训练来降低元强化学习中的样本低效问题。
通过概率潜在上下文在在线推断任务不确定性，以实现有结构的探索。
将任务推断与控制解耦，以利用离策略强化学习实现高效的元训练。
通过对任务上下文的后验采样，在测试时实现快速的轨迹级自适应。

提出的方法

引入一个对策略条件化的概率潜在上下文 Z：π(a|s,z)。
使用摊销的变分编码器 qφ(z|c) 从最近的经验 c 推断后验 p(z|c)。
使用对置换不变的编码器将上下文建模为对单个转移的高斯因子的乘积。
在测试时从 qφ(z|c) 抽取 z 并在一个回合内固定，以实现有结构的探索。
使用离策略数据分开训练编码器，演员/评论家通过类似 SAC 的目标进行更新。
通过将上下文采样与强化学习数据收集解耦，将该方法置于离策略元强化学习框架中（ Algorithm 1 ）。

实验结果

研究问题

RQ1如何在保持对新任务的快速适应性的同时，高效地进行 RL 的离策略元训练？
RQ2在稀疏奖励、未见任务设置中，概率潜在上下文是否能实现有效的、时间上扩展的探索？
RQ3在元强化学习中，将任务推断与控制解耦在多大程度上提高样本效率和最终性能？
RQ4对比元强化学习中的探索，任务上下文的后验采样相对于以前的方法如何？
RQ5在离策略元强化学习中，用于训练编码器和策略的关键数据采样策略是什么？

主要发现

PEARL 在元训练样本效率方面比先前的元强化学习方法提升了 20-100X。
PEARL 在六个连续控制元学习基准上的渐近性能显著提升。
对潜在任务上下文的后验采样使得探索在时间上得以扩展，有助于在稀疏奖励任务中实现快速自适应。
将上下文推断与演员-评论家解耦，使离策略元训练有效，元训练与元测试之间的分布失配最小。
在稀疏奖励下，概率潜在上下文对于探索至关重要，在稀疏导航任务中优于确定性上下文变体和先前方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。