QUICK REVIEW

[论文解读] Learning to Understand Goal Specifications by Modelling Reward

Dzmitry Bahdanau, Felix Hill|arXiv (Cornell University)|Jun 5, 2018

Reinforcement Learning in Robotics被引用 69

一句话总结

AGILE 使用从专家目标状态与代理经验共同训练的学习奖励模型得到的奖励来训练指令条件化的强化学习代理，能够在没有硬编码环境奖励的情况下理解指令，并对新的环境具有泛化能力。

ABSTRACT

Recent work has shown that deep reinforcement-learning agents can learn to follow language-like instructions from infrequent environment rewards. However, this places on environment designers the onus of designing language-conditional reward functions which may not be easily or tractably implemented as the complexity of the environment and the language scales. To overcome this limitation, we present a framework within which instruction-conditional RL agents are trained using rewards obtained not from the environment, but from reward models which are jointly trained from expert examples. As reward models improve, they learn to accurately reward agents for completing tasks for environment configurations---and for instructions---not present amongst the expert data. This framework effectively separates the representation of what instructions require from how they can be executed. In a simple grid world, it enables an agent to learn a range of commands requiring interaction with blocks and understanding of spatial relations and underspecified abstract arrangements. We further show the method allows our agent to adapt to changes in the environment without requiring new expert examples.

研究动机与目标

通过从目标状态的示例中学习奖励来减少对人工设计的语言条件化奖励的依赖。
提出一个框架，联合学习指令条件化的奖励模型和策略。
在无需新的专家示例的情况下实现对新环境的适应。
证明在简单的网格世界任务中，学习的奖励能够像真实环境奖励一样有效地引导代理。

提出的方法

引入对抗性示例驱动目标学习（AGILE），其中判别器 D_phi 学习预测状态 s 是否为指令 c 的目标。
训练策略 pi_theta 以最大化期望折扣奖励，使用建模后的奖励 hat{r}_t = [D_phi(c, s_t) > 0.5]。
通过对数据集 D 中的专家 (c, s) 目标状态示例与来自回放缓冲区 B 的代理起源的 (c, s) 配对进行判别，来更新奖励模型，采用交叉熵目标 L_D(phi)。
在更新 D_phi 时，使用采样启发式来处理假阴性：舍弃来自 B 的低奖励状态中前 1-ρ 百分比的样本，其中 ρ 为预期的负样本率。
将用 AGILE 训练的策略（AGILE-A3C）与使用真实环境奖励训练的策略以及辅助奖励预测基线（RP）进行比较。
探索两种模型架构（FiLM-NMN 与 FiLM-LSTM）来对指令进行编码并将其与视觉状态表示绑定。

实验结果

研究问题

RQ1在没有环境奖励的情况下，是否可以用条件化于指令的学习奖励模型有效监督强化学习策略？
RQ2AGILE 是否在不同指令类型下实现更快的学习并达到与环境奖励基线相当的性能？
RQ3奖励模型在未见指令和环境变化下的泛化能力如何？
RQ4奖励模型能否在新配置下重复用于训练或微调策略？

主要发现

在带环境奖励的标准 A3C 下，AGILE-A3C 更容易学习 GridLU-Relations 任务。
使用辅助奖励预测目标进一步提升 A3C 的性能，接近 AGILE 的表现。
奖励模型可以达到很高的准确率（约 99%+），并通过早期的假阳性提供有用的学习进度。
采用结构无关的 FiLM-LSTM 的 AGILE 达到高成功率，表明语言接地并不严格需要 NMN 结构。
奖励模型在环境动力学变化时通过调整策略性能来展示泛化能力，微调有助于恢复。
GridLU-Arrangements 表明 AGILE 能扩展到更大的目标状态空间，在有限的专家目标数据（100,000 个示例）和人工评估的最终状态下实现有意义的成功。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。