QUICK REVIEW

[论文解读] On Learning Intrinsic Rewards for Policy Gradient Methods

Zeyu Zheng, Junhyuk Oh|arXiv (Cornell University)|Apr 17, 2018

Reinforcement Learning in Robotics参考文献 21被引用 33

一句话总结

本文提出 LIRPG，一种新颖的随机梯度方法，用于为策略梯度智能体学习参数化的内在奖励，使其在稀疏奖励环境中提升学习效率。该方法通过训练内在奖励以最大化外在性能，在 5 个 MuJoCo 环境中的 4 个以及所有 15 个 Atari 游戏中，显著优于仅使用外在奖励和实时奖励基线方法。

ABSTRACT

In many sequential decision making tasks, it is challenging to design reward functions that help an RL agent efficiently learn behavior that is considered good by the agent designer. A number of different formulations of the reward-design problem, or close variants thereof, have been proposed in the literature. In this paper we build on the Optimal Rewards Framework of Singh et.al. that defines the optimal intrinsic reward function as one that when used by an RL agent achieves behavior that optimizes the task-specifying or extrinsic reward function. Previous work in this framework has shown how good intrinsic reward functions can be learned for lookahead search based planning agents. Whether it is possible to learn intrinsic reward functions for learning agents remains an open problem. In this paper we derive a novel algorithm for learning intrinsic rewards for policy-gradient based learning agents. We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains.

研究动机与目标

解决在稀疏或复杂序列决策任务中设计有效奖励函数的挑战，其中外在奖励稀疏或模糊。
克服传统奖励塑造和内在动机方法的局限性，后者依赖于手工设计的奖励或固定函数形式。
开发一种可扩展的端到端方法，用于学习内在奖励函数，以提升策略梯度学习效果，且无需前瞻规划或外部监督。
通过学习优化以最大化外在回报的内在奖励，使策略梯度智能体在计算和表征约束下仍能实现更优性能。

提出的方法

将内在奖励学习问题形式化为双层优化：策略被训练以最大化外在奖励与内在奖励之和，而内在奖励参数则通过提升外在性能来更新。
使用随机梯度下降，通过策略梯度的可微分近似，联合优化策略参数与内在奖励参数。
使用元学习目标训练内在奖励模块：内在奖励被更新以最大化策略所实现的预期外在回报。
将该方法应用于 A2C 和 PPO 智能体，通过在基线与增强智能体之间共享架构和超参数，确保公平比较。
在 MuJoCo 环境中引入延迟奖励机制，以模拟稀疏反馈，从而增加学习任务的难度。
在消融研究中，仅使用内在奖励训练策略，以评估所学内在奖励是否具备足够结构以驱动复杂行为。

实验结果

研究问题

RQ1所学习的内在奖励函数是否能显著提升策略梯度智能体在稀疏奖励环境中的样本效率和最终性能？
RQ2在 Atari 和 MuJoCo 环境中，通过基于梯度的优化学习内在奖励是否优于固定内在奖励（如“生命奖励”）？
RQ3是否可以仅使用所学的内在奖励训练策略，而无需任何外在奖励信号，同时仍实现具有竞争力的性能？
RQ4该方法在具有不同稀疏度和复杂度的多样化环境中表现如何？
RQ5内在奖励函数在多大程度上捕捉了外在奖励的潜在结构，从而实现超越简单探索奖励的泛化能力？

主要发现

在使用 A2C 的 15 个 Atari 游戏中，LIRPG 在所有测试环境中均表现出一致的性能提升。
在 MuJoCo 环境中，当外在奖励延迟 20 步时，LIRPG 在 5 个环境中的 4 个（Hopper、HalfCheetah、Walker2d、Ant）优于仅使用外在奖励的 PPO 基线方法。
在 5 个 MuJoCo 环境中的 4 个中，LIRPG 超过了“生命奖励”基线方法，且在 HalfCheetah 上表现相当。
在消融研究中，仅使用内在奖励训练策略在 5 个 MuJoCo 环境中的 3 个中，性能与使用内在与外在奖励混合的策略相当。
在 Hopper 环境中，仅使用内在奖励的训练性能虽低于混合方法，但仍优于仅使用“生命奖励”的训练，表明所学内在奖励不仅包含生存信号。
结果表明，所学内在奖励编码了比简单探索奖励更复杂、更与任务相关的结构，即使在缺乏外在反馈的情况下也能实现有效学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。