QUICK REVIEW

[论文解读] Inverse Reward Design

Dylan Hadfield-Menell, Smitha Milli|arXiv (Cornell University)|Nov 8, 2017

Advanced Multi-Objective Optimization Algorithms参考文献 21被引用 63

一句话总结

论文定义逆向奖励设计（IRD），用于从设计者提供的代理奖励中推断真实目标，并结合对风险规避的规划来缓解错设奖励和奖励劫持。

ABSTRACT

Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking.

研究动机与目标

激发并形式化自治代理中错设奖励函数的问题。
将逆向奖励设计（IRD）问题定义为在训练MDP中从代理奖励推断真实奖励。
提出概率/贝叶斯方法来近似IRD后验。
展示将IRD与风险规避规划结合如何提升对奖励错设的鲁棒性。

提出的方法

将代理奖励建模为对设计者在训练 MDP 中真实奖励的观测。
将IRD问题定义为推断真实奖励的分布 P(w*|~w, ~M)。
引入一个观测模型，其中代理奖励来自近似最优设计者，通过最大熵轨迹分布。
开发高效的IRD后验近似方法，包括基于采样的（Sample-Z）和MaxEnt-Z方法，以处理不可积的归一化常数。
将IRD与贝叶斯逆强化学习以及务实语言/务实解释相关联，以为推断方法提供理论依据。
在测试 MDP 的决策过程中应用风险规避规划以利用IRD后验。

实验结果

研究问题

RQ1在给定代理奖励和训练环境的情况下，我们如何推断设计者的真实目标？
RQ2基于IRD的后验是否有助于代理在未见环境中避免错设奖励？
RQ3尽管似然性不可计算（双不可计算问题），IRD是否可以高效近似？
RQ4使用IRD后验的风险规避规划是否减少负面副作用和奖励劫持？
RQ5IRD与标准逆强化学习有何关系和区别？

主要发现

IRD结合风险规避规划可减少诸如在设计时未见的危险区域中穿越等负面副作用。
IRD后验通过考虑对真实目标的不确定性来对奖励劫持起到对冲作用。
近似推断技术（Sample-Z、MaxEnt-Z）使在相关领域的IRD后验的实际估计成为可能。
在潜在奖励设置中，若未观测到合适的特征，IRD仍能通过将代理奖励视为与上下文相关的观测来引导代理避免灾难性结果。
该方法在简单领域显示出鲁棒性，并为处理更复杂的奖励错设提供了一条路径。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。