QUICK REVIEW

[論文レビュー] Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping

Yujing Hu, Weixun Wang|arXiv (Cornell University)|Nov 5, 2020

Reinforcement Learning in Robotics参考文献 25被引用数 94

ひとこと要約

本論文は BiPaRS を提案する。これは与えられた shaping 報酬関数を学習する shaping 重み関数を通じて適応的に利用する二階層最適化フレームワークであり、3つの勾配ベースアルゴリズム（EM、MGL、IMGL）を備え、CartPole と MuJoCo での実証評価により、利益をもたらす shaping 報酬を増幅し、害となるものを緩和できることを示す。

ABSTRACT

Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL). Existing approaches such as potential-based reward shaping normally make full use of a given shaping reward function. However, since the transformation of human knowledge into numeric reward values is often imperfect due to reasons such as human cognitive bias, completely utilizing the shaping reward function may fail to improve the performance of RL algorithms. In this paper, we consider the problem of adaptively utilizing a given shaping reward function. We formulate the utilization of shaping rewards as a bi-level optimization problem, where the lower level is to optimize policy using the shaping rewards and the upper level is to optimize a parameterized shaping weight function for true reward maximization. We formally derive the gradient of the expected true reward with respect to the shaping weight function parameters and accordingly propose three learning algorithms based on different assumptions. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards, and meanwhile ignore unbeneficial shaping rewards or even transform them into beneficial ones.

研究の動機と目的

報酬 shaping を、強化学習（RL）へドメイン知識を注入する手段として動機付ける。
既存の shaping 報酬の適応的利用を二階層最適化問題として定式化する。
真の報酬を最大化するための shaping 重みを最適化する勾配ベースの手法を開発する。
提案手法が有益な shaping 信号を識別し、有害な信号を抑制または変換できることを実証する。

提案手法

修正報酬を r' = r + z_phi(s,a) f(s,a) と表す。
二階層の目的を定義する：真の報酬 J(z_phi) を最大化する一方、修正報酬を最大化する方針を持つ政策は θ に関して tilde{J} を最大化する。
φ に対する J(z_phi) の勾配を導出し、4つの近似アルゴリズムを提案する：Explicit Mapping (EM)、Meta-Gradient Learning (MGL)、Incremental Meta-Gradient Learning (IMGL)。
勾配の表現を提供する：式(4) は [0m[0m[0m[0m、式(5) は [0m[0m[0m[0m、加えて更新則を詳述する式(6)–(9)。
z_phi を拡張状態空間 S_z へ明示的に写像すること、及び超方策（hyper-policy）形式について論じる。
補足資料として、計算量の考察とアルゴリズム手順を提示する。

実験結果

リサーチクエスチョン

RQ1二階層最適化フレームワークは、有益な shap ing 報酬とそうでない報酬を効果的に区別できるか？
RQ2 shaping 重みパラメータに対する真の報酬の勾配をどのように計算・近似するか？
RQ3勾配ベースのアルゴリズム（EM、MGL、IMGL）は、報酬 shaping を活用する一方で有害なものを無視・変換できるか？
RQ4提案手法は、単純な環境とより複雑な環境（CartPole、MuJoCo）および有害性やランダムな shaping 信号を伴う適応性テストで有効か？
RQ5状態-行動依存の shaping 重み付けは、単一の一様重みより有利か？

主な発見

BiPaRS は shaping 報酬の品質を識別し、有益な信号を適応的に活用できる。
方法は有益でない shaping 報酬を無視するか、利得の高いものへ変換できる。
BiPaRS-系は CartPole および MuJoCo のタスクで naive shaping や DPBA より学習性能を改善する。
適応性のテストで、本手法は害となる shaping 報酬の影響を抑え、ベースラインの性能に近づくかそれを上回ることを維持する。
状態-行動依存の shaping 重みは、利益が混在する状況で単一の一様重みよりも上回ることがある。
本手法は、 globally 一様ではなく局所的な状態-行動特性を反映する shaping 重みを生成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。