QUICK REVIEW

[论文解读] Unsupervised Predictive Memory in a Goal-Directed Agent

Greg Wayne, Chia-Chun Hung|arXiv (Cornell University)|Mar 28, 2018

Reinforcement Learning in Robotics参考文献 14被引用 148

一句话总结

MERLIN 是一个人工智能代理，使用通过无监督预测建模训练的基于记忆的预测器来解决高度部分可观测的任务，在心理学/神经科学基准测试中优于标准记忆强化学习代理。它构建紧凑的状态变量，将它们存储在记忆中，并使用回报预测来塑造表示与记忆使用。

ABSTRACT

Animals execute goal-directed behaviours despite the limited range and scope of their sensors. To cope, they explore environments and store memories maintaining estimates of important information that is not presently available. Recently, progress has been made with artificial intelligence (AI) agents that learn to perform tasks from sensory input, even at a human level, by merging reinforcement learning (RL) algorithms with deep neural networks, and the excitement surrounding these results has led to the pursuit of related ideas as explanations of non-human animal learning. However, we demonstrate that contemporary RL algorithms struggle to solve simple tasks when enough information is concealed from the sensors of the agent, a property called "partial observability". An obvious requirement for handling partially observed tasks is access to extensive memory, but we show memory is not enough; it is critical that the right information be stored in the right format. We develop a model, the Memory, RL, and Inference Network (MERLIN), in which memory formation is guided by a process of predictive modeling. MERLIN facilitates the solution of tasks in 3D virtual reality environments for which partial observability is severe and memories must be maintained over long durations. Our model demonstrates a single learning agent architecture that can solve canonical behavioural tasks in psychology and neurobiology without strong simplifying assumptions about the dimensionality of sensory input or the duration of experiences.

研究动机与目标

激励具备记忆能力的代理，在传感器无法捕捉到重要信息的部分可观测环境中进行运作。
开发 MERLIN，一个基于记忆的预测器，将观测信息压缩成状态变量并存储以用于预测。
证明无监督预测建模能够引导记忆形成，并提升在受心理学/神经科学启发的任务上的表现。

提出的方法

介绍 MERLIN，一种将基于记忆的预测器（MBP）与具备读写记忆机制的策略相结合的代理架构。
MBP 通过变分自编码器样框架将多模态观测编码为低维状态变量 z，并将它们存储在记忆中。
使用先验 p(z_t|z_{1:t-1},a_{1:t-1}) 和后验 q(z_t|z_{1:t-1},a_{1:t-1},o_t) 来采样 z_t 并更新记忆。
用变分下界（VLB）来训练 MBP，该下界由多模态重建损失和 p 与 q 之间的 KL 项组成，并加上一个回报预测解码器，引导 z_t 指向与奖励相关的信息。
将 MBP 的优化与策略解耦，以确保表征学习由预测建模驱动，而非仅由奖励驱动。
除了 MBP，还采用回溯性记忆更新，将未来信息附着到过去的记忆上，并探索回报预测如何塑造表征。

实验结果

研究问题

RQ1无监督预测记忆是否能够使基于记忆的代理在观测与决策之间存在长时间延迟的任务中取得解决？
RQ2通过预测建模将感官输入压缩成状态变量，是否比端到端的记忆强化学习系统更有助于记忆形成与检索？
RQ3记忆读取是否对与目标具有不同时间距离的信息进行专门化，从而实现分层的目标导向行为？
RQ4MERLIN 能否在没有强假设简化的情况下，仅用原始感官数据解决一次性导航等受心理学/神经科学启发的任务？

主要发现

MERLIN 能解决对记忆需求较高的任务（如记忆游戏、在大型环境中的导航），而 RL-LSTM 和 RL-MEM 很难或失败。
MBP 将高维感官输入压缩到约 10^2 个状态变量，通过预测建模保留任务相关信息。
来自 MBP 的记忆读取对在距离目标不同距离处形成的记忆进行专门化，支持分层的目标导向策略。
MERLIN 展现出快速的等方位目标定位和稳健的回报预测，指导记忆使用与规划。
在一系列任务上，包括任意视觉-运动映射和快速奖励估值，MERLIN 超越端到端记忆基线，在某些情况下甚至达到或超过人类水平。
潜在学习与回溯性记忆更新使 MERLIN 能在需要时回忆并利用较早获得的信息，甚至超出传统的时间反向传播范围。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。