QUICK REVIEW

[論文レビュー] Automatic Data Augmentation for Generalization in Deep Reinforcement Learning

Roberta Răileanu, Max A. Goldstein|arXiv (Cornell University)|Jun 23, 2020

Reinforcement Learning in Robotics参考文献 67被引用数 52

ひとこと要約

この論文は DrAC を提案し、データ正則化された actor-critic フレームワークと、RL タスクの効果的なデータ拡張を自動的に選択する三つの拡張戦略（UCB-DrAC、RL2-DrAC、Meta-DrAC）を提示しており、Procgen での一般化性能は最先端、DeepMind Control の distractor に対しても強力な結果を示している。

ABSTRACT

Deep reinforcement learning (RL) agents often fail to generalize to unseen scenarios, even when they are trained on many instances of semantically similar environments. Data augmentation has recently been shown to improve the sample efficiency and generalization of RL agents. However, different tasks tend to benefit from different kinds of data augmentation. In this paper, we compare three approaches for automatically finding an appropriate augmentation. These are combined with two novel regularization terms for the policy and value function, required to make the use of data augmentation theoretically sound for certain actor-critic algorithms. We evaluate our methods on the Procgen benchmark which consists of 16 procedurally-generated environments and show that it improves test performance by ~40% relative to standard RL algorithms. Our agent outperforms other baselines specifically designed to improve generalization in RL. In addition, we show that our agent learns policies and representations that are more robust to changes in the environment that do not affect the agent, such as the background. Our implementation is available at https://github.com/rraileanu/auto-drac.

研究の動機と目的

トレーニング環境への過適合による深層 RL の一般化ギャップに対処する。
actor-critic 法のための理論的に妥当なデータ拡張フレームワークを提案する。
状態変換に対するポリシーと値関数の不変性を強制する正則化項を開発する。
自動的に有効な拡張を UCB、RL2 メタ学習、または CNN 重み学習を通じて選択する。
Procgen での最先端性能と、背景が不関連な環境変化に対するロバスト性を示す。

提案手法

Data-regularized Actor-Critic (DrAC) を導入し、ポリシー正則化と値関数正則化という二つの正則化項を用いる。
不変性を課す最適性不変な状態変換 f(s, ν) を用い、V(s)=V(f(s,ν)) および π(a|s)=π(a|f(s,ν)) を保証する。
標準の actor-critic 目的（PPO）を維持し、正則化損失 G_π と G_V を α_r で重み付けして引く。
三つの自動拡張戦略を提供する：UCB-DrAC（帯域選択）、RL2-DrAC（メタ学習選択）、Meta-DrAC（CNN 拡張の重み）。
拡張選択を非定常バンディットまたはメタ学習問題として近似しつつ、エージェントの更新と同時進行で進める。
cycle-consistency および JSD 分析を通じて不変性と頑健性を示す。

実験結果

リサーチクエスチョン

RQ1データ拡張は目的推定を崩さずに actor-critic RL アルゴリズムと安全に併用できるか。
RQ2RL における一般化を改善するタスク特有の拡張を自動的に特定できるか。
RQ3状態変換への正則化が観測の拡張時の安定性と性能を改善するか。
RQ4自動拡張手法（UCB-DrAC、RL2-DrAC、Meta-DrAC）は Procgen および Distractor を用いた DM Control でどう比較されるか。
RQ5学習された表現は irrelevant な視覚的変化（背景など）に対してより不変になるか。

主な発見

方法	訓練中央値	訓練平均	訓練標準偏差	テスト中央値	テスト平均	テスト標準偏差
PPO	100.0	100.0	7.2	100.0	100.0	8.5
Rand-FM	93.4	87.6	8.9	91.6	78.0	9.0
IBAC-SNI	91.9	103.4	8.5	86.2	102.9	8.6
Mixreg	95.8	104.2	3.1	105.9	114.6	3.3
PLR	101.5	106.7	5.6	107.1	128.3	5.8
DrAC (Best) (Ours)	114.0	119.6	9.4	118.5	138.1	10.5
RAD (Best)	103.7	109.1	9.6	114.2	131.3	9.4
UCB-DrAC (Ours)	102.3	118.9	8.8	118.5	139.7	8.4
RL2-DrAC	96.3	95.0	8.8	99.1	105.3	7.1
Meta-DrAC	101.3	100.1	8.5	101.7	101.2	7.3

UCB-DrAC は Procgen で最先端の性能を達成し、いくつかのベースラインを上回り、最良タスク拡張と同等またはそれを上回る。
ポリシーと値関数の両方を正則化することが重要で、DrAC は片方のみを正則化する変種より優れている。
UCB-DrAC による自動拡張は固定拡張ベースラインを上回る堅牢で安定した性能をゲーム全体で提供する。
背景ノイズを含む DeepMind Control の難しい設定で、UCB-DrAC は PPO および RAD を一貫して上回る。
Procgen 全体で、UCB-DrAC は背景感度を低く（循環整合性が高く）表現の不変性が高い結果を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。