QUICK REVIEW

[論文レビュー] Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine|ArXiv.org|May 2, 2018

Reinforcement Learning in Robotics参考文献 41被引用数 374

ひとこと要約

本論文は、最大エントロピー強化学習および制御問題をグラフィカルモデルにおける確率的推論として扱う方法を示し、決定論的ダイナミクスには厳密推論を、確率的ダイナミクスには変分法を導出し、ディープRLおよびプランニングとの関連を示す。

ABSTRACT

The framework of reinforcement learning or optimal control provides a mathematical formalization of intelligent decision making that is powerful and broadly applicable. While the general form of the reinforcement learning problem enables effective reasoning about uncertainty, the connection between reinforcement learning and inference in probabilistic models is not immediately obvious. However, such a connection has considerable value when it comes to algorithm design: formalizing a problem as probabilistic inference in principle allows us to bring to bear a wide array of approximate inference tools, extend the model in flexible and powerful ways, and reason about compositionality and partial observability. In this article, we will discuss how a generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference in the case of stochastic dynamics. We will present a detailed derivation of this framework, overview prior work that has drawn on this and related ideas to propose new reinforcement learning and control algorithms, and describe perspectives on future research.

研究の動機と目的

エントロピー項を伴う強化学習と制御の統一的な確率的グラフィカルモデル（PGM）形式を提供する。
このPGMにおいて最適軌道が推論として生じることを示し、対応する後向きメッセージとソフト値関数を導出する。
決定論的ダイナミクスと確率的ダイナミクスを対比し、非現実的なリスク追求動作を避けるために変分推論が必要であることを強調する。
目的の明確化：エントロピーが含まれる場合、最大エントロピーRLを回復し、報酬設計と方策学習への影響を説明する。

提案手法

補助的な最適性変数 O_t を用いてRL/制御目的の最大エントロピー拡張を導入し、p(O_t=1|s_t,a_t)=exp(r(s_t,a_t))とする。
軌道が exp(sum_t r(s_t,a_t)) で重み付けられるPGMを定式化し、決定論的ダイナミクスでは(厳密)推論を、確率的ダイナミクスでは変種/推論ベースの方法を適用する。
β_t(s_t,a_t) および β_t(s_t) という後向きメッセージを導出し、p(a_t|s_t,O_1:T) を回復し、(Q(s,a)=r(s,a)+V(s')) に対するソフトQ/V関数との関係を示す。
対数空間のバックアップ Q および V を提示し、決定論的の場合のソフトBellmanバックアップと結びつけ、確率的ダイナミクスにおけるリスク志向行動を議論する（変分補正で対処）。
代替的なモデル形式（無向CRF、温度パラメータ alpha）と割引を論じ、標準のRLとエントロピー正則化RLの枠組みと結びつける。

実験結果

リサーチクエスチョン

RQ1強化学習と最適制御をグラフィカルモデルにおける確率的推論として再構成するにはどうすればよいか？
RQ2決定論的と確率的ダイナミクスにおけるエントロピー正則化目的の挙動と解釈はどうなるか？
RQ3推論としての制御フレームワークにおいて、後向きメッセージを用いて最適な方策をいかに計算できるか？
RQ4最大エントロピー形式で確率的ダイナミクスがもたらすリスク志向問題を変分推論はどう修正するか？
RQ5CRF、温度、割引は標準RLおよび最大エントロピーRLとどのように関連するか？

主な発見

RL/制御の最大エントロピー定式は、決定論的ダイナミクスで厳密推論、確率的ダイナミクスで変分推論に対応する。
最適方策は β_t(s_t,a_t) および β_t(s_t) の後向きメッセージを用いて回復でき、ソフトQ/V関数につながる。
対数空間では、ソフトベルマンバックアップがエントロピーによる探索のスケジューリングと、確率的ダイナミクス下のリスク志向効果を明らかにする。
変分推論アプローチはダイナミクスを固定（p(s_{t+1}|s_t,a_t) の凍結）し、次状態の期待値を用いるロバストなバックアップを生み出し、リスク志向行動を抑制する。
代替的な定式化（無向CRF、温度パラメータ）によりエントロピー最大化と標準RL目的の間を補間できる；割引は自明に組み込める。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。