[论文解读] RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning
RL2 在一个 RNN 内编码一个快速强化学习算法,其权重由一个慢速外循环 RL 学习得到,从而实现对新 MDP 的快速适应并扩展到高维任务。
Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.
研究动机与目标
- 通过元学习利用先验经验来降低 RL 的样本复杂度的动机。
- 提出 RL2,一种基于 RNN 的快速 RL 学习者,其内部学习存储在激活中,外部训练使用慢速 RL 算法。
- 在赌博机、表格 MDP 和基于视觉的导航任务上评估 RL2,以评估小尺度最优性及大规模可扩展性。
提出的方法
- 将策略表示为一个 RNN(基于 GRU),其输入为 (s, a, r, d) 并输出动作。
- 将快速 RL 算法的学习视为在一组 MDP 分布上的 RL 问题,优化与跨试验累积折现回报相关的目标。
- 使用信赖域策略优化(TRPO)对外循环进行训练,并采用基于 GRU 的基线来稳定学习。
- 在同一试验的若干回合中保持 RNN 状态,以在隐藏激活中编码快速学习的动态。
- 通过在 POMDP 视角下框架处理部分可观测设置;在基于视觉的任务(ViZDoom)中应用以证明可扩展性。
实验结果
研究问题
- RQ1RL2 是否能够在结构化 MDP 类(如赌博机和表格 MDP)上达到理论上最优算法的近似性能?
- RQ2RL2 是否可以扩展到诸如基于视觉的导航等高维任务?
- RQ3在不同的时域和状态-动作空间下,RL2 相对于既定的贝叶斯和探索-开发方法的表现如何?
- RQ4外循环优化中的瓶颈是什么,是否可以通过架构选择来缓解?
主要发现
| 设置 | 随机 | Gittins | TS | OTS | UCB1 | ϵ-Greedy | Greedy | RL2 |
|---|---|---|---|---|---|---|---|---|
| n = 10, k = 5 | 5.0 | 6.6 | 5.7 | 6.5 | 6.7 | 6.6 | 6.6 | 6.7 |
| n = 10, k = 10 | 5.0 | 6.6 | 5.5 | 6.2 | 6.7 | 6.6 | 6.6 | 6.7 |
| n = 10, k = 50 | 5.1 | 6.5 | 5.2 | 5.5 | 6.6 | 6.5 | 6.5 | 6.8 |
| n = 100, k = 5 | 49.9 | 78.3 | 74.7 | 77.9 | 78.0 | 75.4 | 74.8 | 78.7 |
| n = 100, k = 10 | 49.9 | 82.8 | 76.7 | 81.4 | 82.4 | 77.4 | 77.1 | 83.5 |
| n = 100, k = 50 | 49.8 | 85.2 | 64.5 | 67.7 | 84.3 | 78.3 | 78.0 | 84.9 |
| n = 500, k = 5 | 249.8 | 405.8 | 402.0 | 406.7 | 405.8 | 388.2 | 380.6 | 401.6 |
| n = 500, k = 10 | 249.0 | 437.8 | 429.5 | 438.9 | 437.1 | 408.0 | 395.0 | 432.5 |
| n = 500, k = 50 | 249.6 | 463.7 | 427.2 | 437.6 | 457.6 | 413.6 | 402.8 | 438.9 |
- 在多臂赌博机和表格 MDP 的若干设置中,RL2 的表现接近理论上被证明正确的算法。
- 在大规模基于视觉的导航中,RL2 展示了利用视觉信息和在若干回合中积累的短期记忆的能力。
- 在表格 MDP 的短期任务中,RL2 可超过若干基线,随着回合数增加优势减弱。
- 在视觉导航任务中,RL2 从第一回合到第二回合的轨迹长度显著缩短,表明对 past 经验的有效利用。
- 学习曲线揭示了随机初始化之间的变异性,强调了对外循环优化和架构的敏感性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。