QUICK REVIEW

[论文解读] RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman|arXiv (Cornell University)|Nov 9, 2016

Reinforcement Learning in Robotics参考文献 6被引用 501

一句话总结

RL2 在一个 RNN 内编码一个快速强化学习算法，其权重由一个慢速外循环 RL 学习得到，从而实现对新 MDP 的快速适应并扩展到高维任务。

ABSTRACT

Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.

研究动机与目标

通过元学习利用先验经验来降低 RL 的样本复杂度的动机。
提出 RL2，一种基于 RNN 的快速 RL 学习者，其内部学习存储在激活中，外部训练使用慢速 RL 算法。
在赌博机、表格 MDP 和基于视觉的导航任务上评估 RL2，以评估小尺度最优性及大规模可扩展性。

提出的方法

将策略表示为一个 RNN（基于 GRU），其输入为 (s, a, r, d) 并输出动作。
将快速 RL 算法的学习视为在一组 MDP 分布上的 RL 问题，优化与跨试验累积折现回报相关的目标。
使用信赖域策略优化（TRPO）对外循环进行训练，并采用基于 GRU 的基线来稳定学习。
在同一试验的若干回合中保持 RNN 状态，以在隐藏激活中编码快速学习的动态。
通过在 POMDP 视角下框架处理部分可观测设置；在基于视觉的任务（ViZDoom）中应用以证明可扩展性。

实验结果

研究问题

RQ1RL2 是否能够在结构化 MDP 类（如赌博机和表格 MDP）上达到理论上最优算法的近似性能？
RQ2RL2 是否可以扩展到诸如基于视觉的导航等高维任务？
RQ3在不同的时域和状态-动作空间下，RL2 相对于既定的贝叶斯和探索-开发方法的表现如何？
RQ4外循环优化中的瓶颈是什么，是否可以通过架构选择来缓解？

主要发现

设置	随机	Gittins	TS	OTS	UCB1	ϵ-Greedy	Greedy	RL2
n = 10, k = 5	5.0	6.6	5.7	6.5	6.7	6.6	6.6	6.7
n = 10, k = 10	5.0	6.6	5.5	6.2	6.7	6.6	6.6	6.7
n = 10, k = 50	5.1	6.5	5.2	5.5	6.6	6.5	6.5	6.8
n = 100, k = 5	49.9	78.3	74.7	77.9	78.0	75.4	74.8	78.7
n = 100, k = 10	49.9	82.8	76.7	81.4	82.4	77.4	77.1	83.5
n = 100, k = 50	49.8	85.2	64.5	67.7	84.3	78.3	78.0	84.9
n = 500, k = 5	249.8	405.8	402.0	406.7	405.8	388.2	380.6	401.6
n = 500, k = 10	249.0	437.8	429.5	438.9	437.1	408.0	395.0	432.5
n = 500, k = 50	249.6	463.7	427.2	437.6	457.6	413.6	402.8	438.9

在多臂赌博机和表格 MDP 的若干设置中，RL2 的表现接近理论上被证明正确的算法。
在大规模基于视觉的导航中，RL2 展示了利用视觉信息和在若干回合中积累的短期记忆的能力。
在表格 MDP 的短期任务中，RL2 可超过若干基线，随着回合数增加优势减弱。
在视觉导航任务中，RL2 从第一回合到第二回合的轨迹长度显著缩短，表明对 past 经验的有效利用。
学习曲线揭示了随机初始化之间的变异性，强调了对外循环优化和架构的敏感性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。