QUICK REVIEW

[論文レビュー] Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair|arXiv (Cornell University)|Oct 12, 2021

Reinforcement Learning in Robotics参考文献 23被引用数 129

ひとこと要約

Implicit Q-Learning (IQL) はオフライン学習中に未観測の行動を評価しないよう、状態条件付きの期待値（expectiles）を用いて最良のインディストリビューション内の行動を近似し、多段階の動的計画を可能にし、D4RLベンチマークで高い性能を発揮します。

ABSTRACT

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.

研究の動機と目的

データが固定されたデータセットから得られ、オンライン探索が高コストまたは危険である場合のオフライン強化学習を動機づける。
価値学習の間に未観測の行動を照会しない方法を提案する。
データセットの行動分布のサポートを介して方策改善を暗黙的に行うためにexpectile回帰を活用する。
訓練中に明示的な方策を用いずに多段階の動的計画を実現し、その後に簡単な方策抽出ステップを行う。
D4RLベンチマークで強力な実証性能と、オフライン初期化に対する頑健性を示す。

提案手法

データセット内の行動に制約されたターゲットを持つ、非対称のexpectile回帰目的を定義して、状態-行動値を推定する。
行動分布に対するQのexpectileを近似する別の価値関数Vを用い、次状態の報酬 r(s,a)+γV(s′) でQをバックアップする。
expectile損失とSARSA風のTD目的関数を交互更新してQとVを学習し、分布外の行動を回避する。
QとVを用いて未観測行動を照会せずに、アドバンテージ重み付きビヘイビアル cloning (AWR) によって方策を抽出する。
Vと方策の更新を安定化させるためにクリップ付きダブルQ学習を用い、ターゲット推定には2つのQ関数を使用する。
標準的なSARSA様の更新から容易に変更可能で、最新のハードウェア上で効率的な実装を提供する。
オンラインデータと同時に学習を継続することでオンライン微調整を議論する。

実験結果

リサーチクエスチョン

RQ1オフラインRLは分布外の行動を照会せずに、行動ポリシーを大きく上回る政策改善を達成できるか？
RQ2サポート内の行動値をexpectileベースで学習することは、オフラインRLにおいて効果的な多段階の動的計画を可能にするか？
RQ3特にAnt Mazeタスクにおいて、IQLはD4RLベンチマークの多段階・単段階のオフラインRL手法とどう比較されるか？
RQ4分布外照会なしで価値関数を学習した場合、単純な方策抽出法（アドバンテージ加重回帰）は十分か？
RQ5オフライン初期化後にオンラインで効果的にファインチューニングできるか？

主な発見

データセット	BC	10% BC	DT	AWAC	ワンステップ RL	TD3+BC	CQL	IQL（提案手法）
halfcheetah-medium-v2	42.6	42.5	42.6	43.5	48.4	48.3	44.0	47.4
hopper-medium-v2	52.9	56.9	67.6	57.0	59.6	59.3	58.5	66.3
walker2d-medium-v2	75.3	75.0	74.0	72.4	81.8	83.7	72.5	78.3
halfcheetah-medium-replay-v2	36.6	40.6	36.6	40.5	38.1	44.6	45.5	44.2
hopper-medium-replay-v2	18.1	75.9	82.7	37.2	97.5	60.9	95.0	94.7
walker2d-medium-replay-v2	26.0	62.5	66.6	27.0	49.5	81.8	77.2	73.9
halfcheetah-medium-expert-v2	55.2	92.9	86.8	42.8	93.4	90.7	91.6	86.7
hopper-medium-expert-v2	52.5	110.9	107.6	55.8	103.3	98.0	105.4	91.5
walker2d-medium-expert-v2	107.5	109.0	108.1	74.5	113.0	110.1	108.8	109.6
locomotion-v2 total	466.7	666.2	672.6	450.7	684.6	677.4	698.5	692.4
antmaze-umaze-v0	54.6	62.8	59.2	56.7	64.3	78.6	74.0	87.5
antmaze-umaze-diverse-v0	45.6	50.2	53.0	49.3	60.7	71.4	84.0	62.2
antmaze-medium-play-v0	0.0	5.4	0.0	0.0	0.3	10.6	61.2	71.2
antmaze-medium-diverse-v0	0.0	9.8	0.0	0.7	0.0	3.0	53.7	70.0
antmaze-large-play-v0	0.0	0.0	0.0	0.0	0.0	0.2	15.8	39.6
antmaze-large-diverse-v0	0.0	6.0	0.0	1.0	0.0	0.0	14.9	47.5
antmaze-v0 total	100.2	134.2	112.2	107.7	125.3	163.8	303.6	378.0
total	566.9	800.4	784.8	558.4	809.9	841.2	1002.1	1070.4

IQLはAnt Mazeタスクで最先端の性能を達成する。これは、サブ最適な軌跡をつなぐ多段階の動的計画を必要とする領域である。
MuJoCo運動タスクでは、IQLは従来法の中で最良と競合する（特にCQLと同等以上）。
計算効率が高く、例としてGTX1080での1M更新は20分未満で完了し、再実装ベースラインより速い。
より大きなexpectile値（τ）は stitchタスクにとって重要であり、より高いτはAnt MazeにおけるQ-learningの近似性を向上させる。
オフラインの結果はオンライン微調整で補完され、IQLの初期化後にオンライン相互作用を行うと、報告設定でAWACやCQLと比較して競争力がある、または上回る最終性能を得る。
IQLは、価値学習中に分布外の行動を明示的に照会することを避け、単純な重み付きビヘイビアローニング抽出によって効果的な方策を見出す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。