QUICK REVIEW

[論文レビュー] DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

Aviral Kumar, Abhishek Gupta|arXiv (Cornell University)|Mar 16, 2020

Reinforcement Learning in Robotics参考文献 51被引用数 36

ひとこと要約

本論文はブートストラップ型RL手法における補正フィードバックの欠如を指摘し、分布補正リウェイト戦略であるDisCorを提案します。多タスクおよびノイズ報酬設定での安定性と性能を特に向上させます。

ABSTRACT

Deep reinforcement learning can learn effective policies for a wide range of tasks, but is notoriously difficult to use due to instability and sensitivity to hyperparameters. The reasons for this remain unclear. When using standard supervised methods (e.g., for bandits), on-policy data collection provides "hard negatives" that correct the model in precisely those states and actions that the policy is likely to visit. We call this phenomenon "corrective feedback." We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from this corrective feedback, and training on the experience collected by the algorithm is not sufficient to correct errors in the Q-function. In fact, Q-learning and related methods can exhibit pathological interactions between the distribution of experience collected by the agent and the policy induced by training on that experience, leading to potential instability, sub-optimal convergence, and poor results when learning from noisy, sparse or delayed rewards. We demonstrate the existence of this problem, both theoretically and empirically. We then show that a specific correction to the data distribution can mitigate this issue. Based on these observations, we propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training, resulting in substantial improvements in a range of challenging RL settings, such as multi-task learning and learning from noisy reward signals. Blog post presenting a summary of this work is available at: https://bair.berkeley.edu/blog/2020/03/16/discor/.

研究の動機と目的

ADPベースのRLにおいて、ブートストラップされた価値ターゲットが補正フィードバックの恩恵を受けられない理由を調査する。
データ分布と値関数の相互作用による不安定性と非最適な収束を理論的・経験的に示す。
補正フィードバックを復元し学習を安定化させる実用的なデータ分布補正手法を開発する。
DisCorは性能を向上させることを示す。特に多タスクおよびノイズ報酬設定で性能を改善する。

提案手法

バンディットに似た直感と形式的定義を用いて補正フィードバックの概念を分析する。
Bellman更新の下で補正フィードバックを最大化する最適なデータ分布p_kを導出する。
Q*-関連量の扱いやすい代理指標を提案し、重要度重みを用いてリプレイバッファのサンプルを再ウェイト付けする。
実用的な重み関数 w_k(s,a) を exp(-gamma [P^{pi_{k-1}} Δ_{k-1}](s,a)/tau)に比例する形で導入する。
重み付けと誤差モデリングのために、ブートストラップ/バックアップ誤差 Δ_k を推定する二次モデル Δ_phi を訓練する。
標準の DQN/SAC フレームワークの上に、重み付けされた Bellman バックアップと二次モデル Δ を組み合わせたアルゴリズム DisCor を提供する。

実験結果

リサーチクエスチョン

RQ1ブートストラップ型RL手法で補正フィードバックが欠如するメカニズムは何か？
RQ2トレーニング中に補正フィードバックを最大化するようデータ分布をどのように補正できるか？
RQ3最適な分布で遷移の再ウェイト付けは実践的に安定性と性能を向上させるか？
RQ4多タスクRLやノイズ報酬からの学習のような難しい設定でDisCorはどのように機能するか？

主な発見

補正フィードバックはADP法では欠如でき、リプレイバッファがあっても最適でない収束と不安定性を招く。
最適な訓練分布 p_k は高ベルマン誤差領域により高い確率を割り当てつつ Q* への近接も考慮し、扱いやすい代理を介して緩和される。
補正ポテンシャルの推定に基づく重み w_k でリプレイバッファの遷移を再ウェイト付けすることで誤差蓄積を抑制し学習を安定化させる。
DisCorは難易度の高い設定で性能を向上させ、特に報告された結果ではMT10のマルチタスクベンチマークでSACと比較して最終成功率が約50%高い。
このアプローチはDQNやSACなどの標準的なADPベースの深層RLアルゴリズムと互換性があり、ノイズ報酬信号やマルチタスクのシナリオでの学習をサポートします。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。