QUICK REVIEW

[论文解读] Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning

Daniel Guo, Bernardo Ávila Pires|arXiv (Cornell University)|Apr 30, 2020

Reinforcement Learning in Robotics参考文献 38被引用 42

一句话总结

论文提出 Predictions of Bootstrapped Latents (PBL)，一种用于多任务深度强化学习的自监督表示学习方法，其通过预测未来潜在嵌入并使用潜在到状态和状态到潜在的自举循环，在 DMLab-30 和 Atari-57 上提升性能。

ABSTRACT

Learning a good representation is an essential component for deep reinforcement learning (RL). Representation learning is especially important in multitask and partially observable settings where building a representation of the unknown environment is crucial to solve the tasks. Here we introduce Prediction of Bootstrap Latents (PBL), a simple and flexible self-supervised representation learning algorithm for multitask deep RL. PBL builds on multistep predictive representations of future observations, and focuses on capturing structured information about environment dynamics. Specifically, PBL trains its representation by predicting latent embeddings of future observations. These latent embeddings are themselves trained to be predictive of the aforementioned representations. These predictions form a bootstrapping effect, allowing the agent to learn more about the key aspects of the environment dynamics. In addition, by defining prediction tasks completely in latent space, PBL provides the flexibility of using multimodal observations involving pixel images, language instructions, rewards and more. We show in our experiments that PBL delivers across-the-board improved performance over state of the art deep RL agents in the DMLab-30 and Atari-57 multitask setting.

研究动机与目标

在多任务、部分可观测的强化学习设置中推动改进的表示学习。
开发一个聚焦于预测未来观测的潜在嵌入的自监督辅助任务。
引入潜在观测与智能体状态之间的自举机制以丰富表示。
通过在潜在空间中完全操作实现多模态观测的整合。
在 DMLab-30 和 Atari-57 上对比最先进基线进行经验验证，评估 PBL。

提出的方法

定义 Z_t 为通过学习的编码器 f(O_t) 得到的观测的潜在嵌入。
前向预测：使用预测器 g 将压缩的部分历史 B_{t,k} 预测为 Z_{t+k}；对时距 k=1..K 最小化 ||g(B_{t,k})-Z_{t+k}||^2。
反向预测：使用预测器 g' 将潜在 Z_t 预测为压缩历史 B_t；最小化 ||g'(f(O_t))-B_t||^2。
将前向预测器和反向预测器联合训练，形成自举循环以避免简单解。
使用两种 RNN：h_f 表示完整历史，h_p 表示部分历史，以计算 B_t 和 B_{t,k}。
采用 PopArt-IMPALA 作为 RL 基线，并使用更大的架构以提升性能，同时对时步进行子采样以提高效率。）

实验结果

研究问题

RQ1PBL 是否在 DMLab-30 和 Atari-57 的多任务 RL 上优于现有表示学习方法？
RQ2预测时距如何影响 PBL 的性能，反向预测在学习有意义的潜在表示中起什么作用？
RQ3PBL 是否会稳定地避免陷入简单化的表示崩溃，架构选择如何影响结果？
RQ4PBL 编码能否捕捉跨任务的共用结构并推广到未见任务？

主要发现

PBL 在 DMLab-30 的多任务设置中，优于像像素控制、CPC 和 DRAW 这样的辅助表示任务。
当前向时距增加时性能提升，但收益递减，且多步预测比单步更有益。
去掉反向预测（潜在目标随机）仍然受益于更长的时距，表明仅有前向预测在没有有意义的潜在目标时不足；反向预测有助于学习有用的潜在结构。
PBL 保持稳定，不会崩溃为简单解；在潜在路径中使用随机投影仍能获得竞争性结果，表明训练动态鲁棒。
在 Atari-57 上，PBL 提升了跨任务的人类标准化分数的中位数，表明其具有超越 DMLab-30 的泛化性；在多任务中，PBL 在若干任务上优于基线。
解码探针显示 PBL 的表示编码了对象位置信息，并比随机投影基线保留时间更长。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。