QUICK REVIEW

[论文解读] Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping

Yujing Hu, Weixun Wang|arXiv (Cornell University)|Nov 5, 2020

Reinforcement Learning in Robotics参考文献 25被引用 94

一句话总结

本文提出 BiPaRS，是一个双层优化框架，通过学习 shaping 权重函数来自适应地利用给定的 shaping reward 函数，具有三种梯度基算法（EM、MGL、IMGL），在 CartPole 和 MuJoCo 的实证评估显示它可以放大有益的 shaping reward，同时减弱有害的 reward。

ABSTRACT

Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL). Existing approaches such as potential-based reward shaping normally make full use of a given shaping reward function. However, since the transformation of human knowledge into numeric reward values is often imperfect due to reasons such as human cognitive bias, completely utilizing the shaping reward function may fail to improve the performance of RL algorithms. In this paper, we consider the problem of adaptively utilizing a given shaping reward function. We formulate the utilization of shaping rewards as a bi-level optimization problem, where the lower level is to optimize policy using the shaping rewards and the upper level is to optimize a parameterized shaping weight function for true reward maximization. We formally derive the gradient of the expected true reward with respect to the shaping weight function parameters and accordingly propose three learning algorithms based on different assumptions. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards, and meanwhile ignore unbeneficial shaping rewards or even transform them into beneficial ones.

研究动机与目标

将奖励塑造作为将领域知识注入强化学习（RL）的一种手段。
将对现有 shaping reward 的自适应利用形式化为双层优化问题。
开发基于梯度的方法以优化 shaping 权重以实现真实奖励的最大化。
证明该方法能够识别有益的 shaping 信号并抑制或转化有害的信号。

提出的方法

将修改后的奖励建模为 r' = r + z_phi(s,a) f(s,a)。
定义一个双层目标：最大化真实奖励 J(z_phi)，同时以 θ 为参数使得优化修改后奖励的策略最大化 tilde{J}。
推导对 phi 的 J(z_phi) 的梯度，并提出三种梯度近似算法：Explicit Mapping (EM)、Meta-Gradient Learning (MGL) 和 Incremental Meta-Gradient Learning (IMGL)。
给出梯度表达式：(4) 对…，以及 (5) 对…，加上 (6)-(9) 详细说明更新规则。
讨论 z_phi 映射到扩展状态空间 S_z 的显式映射以及超策略（超策略）表述。
在补充材料中给出复杂性考量和算法步骤。

实验结果

研究问题

RQ1双层优化框架是否能够有效区分有益与无益的 shaping 奖励？
RQ2我们如何计算并近似真实奖励相对于 shaping 权重参数的梯度？
RQ3基于梯度的算法（EM、MGL、IMGL）是否使策略能够利用 shaping 奖励，同时忽略或转化有害的奖励？
RQ4所提出的方法在简单到较复杂环境（CartPole、MuJoCo）以及在对有害或随机 shaping 信号的自适应测试中是否有效？
RQ5状态-行动相关的 shaping 权重相对于单一均匀权重是否更具优势？

主要发现

BiPaRS 能识别 shaping 奖励的质量并自适应地利用有益信号。
这些方法可以忽略无益的 shaping 奖励或将其转化为有益的信号。
BiPaRS 变体在 CartPole 和 MuJoCo 任务中提升了学习性能，相较于 naive shaping 和 DPBA。
在自适应性测试中，该方法降低了有害 shaping 奖励的影响并保持接近或超过基线性能。
在混合收益场景中，状态-行动相关的 shaping 权重可优于单一均匀权重。
这些方法产生的 shaping 权重反映局部状态-行动特征，而非全局统一。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。