QUICK REVIEW

[论文解读] Learning Robust Rewards with Adversarial Inverse Reinforcement Learning

Justin Fu, Katie Luo|arXiv (Cornell University)|Oct 30, 2017

Receptor Mechanisms and Signaling被引用 80

一句话总结

AIRL 提出一种可扩展的 IRL 方法，能够学习解耦的奖励，使在未知或变化的动力学下也能进行策略优化，并在迁移任务中优于现有的 IRL 和基于 GAN 的方法，同时在模仿基准上保持一致。

ABSTRACT

Reinforcement learning provides a powerful and general framework for decision making and control, but its application in practice is often hindered by the need for extensive feature and reward engineering. Deep reinforcement learning methods can remove the need for explicit engineering of policy or value features, but still require a manually specified reward function. Inverse reinforcement learning holds the promise of automatic reward acquisition, but has proven exceptionally difficult to apply to large, high-dimensional problems with unknown dynamics. In this work, we propose adverserial inverse reinforcement learning (AIRL), a practical and scalable inverse reinforcement learning algorithm based on an adversarial reward learning formulation. We demonstrate that AIRL is able to recover reward functions that are robust to changes in dynamics, enabling us to learn policies even under significant variation in the environment seen during training. Our experiments show that AIRL greatly outperforms prior methods in these transfer settings.

研究动机与目标

揭示强化学习中奖励工程的瓶颈以及对自动化获取奖励的需求。
开发一种实用的 IRL 算法，能够在不同动态下产生可迁移的奖励。
解决奖励塑形和歧义问题，以学习解耦的奖励。
展示在未知动态下对连续控制的可扩展性，以及学习到的奖励的可迁移性。

提出的方法

采用对抗性 IRL 框架联合学习奖励和价值函数。
使用一个单一的状态-动作判别器，f(s,a) 与解耦后的奖励相关联：f(s,a,s') = g_theta(s,a) + gamma h_phi(s') - h_phi(s)。
约束奖励分量 g_theta 仅与状态相关，以实现与动力学的解耦。
引入一个塑形项 h_phi 以减轻非预期的奖励塑形效应。
通过交替更新进行训练：区分专家样本与策略样本，然后更新奖励模型和策略。
给出理论依据，表明在某些设定下学习到的 g_theta 能在常数偏移下恢复真实奖励。

实验结果

研究问题

RQ1AIRL 是否能够学习对环境动态变化鲁棒的解耦奖励？
RQ2AIRL 是否在高维连续控制任务中具备可扩展性和高效性？
RQ3与先前的 IRL 方法相比，恢复解耦奖励是否能改善对具有不同动态的环境的迁移？

主要发现

AIRL 学习的解耦奖励在动态变化下具备迁移性，在迁移场景中优于简单的 IRL 方法。
在表格型 MDPs 中，状态仅奖励可在常数偏移下重现真实奖励，而状态-动作奖励会产生塑形的优势。
在连续控制的迁移任务中，采用状态仅奖励的 AIRL 能在域变动下成功迁移，而策略或非解耦 IRL 方法则失败。
在标准模仿基准上，AIRL 与 GAIL 相当，但在迁移/泛化情形显著超越 GAIL。
GAN-GCL 在高维任务的轨迹为中心的学习上表现吃力，而 AIRL 仍然具备可扩展性和有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。