QUICK REVIEW

[论文解读] Automatic Data Augmentation for Generalization in Deep Reinforcement Learning

Roberta Răileanu, Max A. Goldstein|arXiv (Cornell University)|Jun 23, 2020

Reinforcement Learning in Robotics参考文献 67被引用 52

一句话总结

本文提出 DrAC，一种数据正则化的 actor-critic 框架，以及三种自动扩增策略 (UCB-DrAC, RL2-DrAC, Meta-DrAC)，用于自动为强化学习任务选择有效的数据增强，在 Procgen 上实现了先进的泛化，并在带干扰项的 DeepMind Control 上取得了显著结果。

ABSTRACT

Deep reinforcement learning (RL) agents often fail to generalize to unseen scenarios, even when they are trained on many instances of semantically similar environments. Data augmentation has recently been shown to improve the sample efficiency and generalization of RL agents. However, different tasks tend to benefit from different kinds of data augmentation. In this paper, we compare three approaches for automatically finding an appropriate augmentation. These are combined with two novel regularization terms for the policy and value function, required to make the use of data augmentation theoretically sound for certain actor-critic algorithms. We evaluate our methods on the Procgen benchmark which consists of 16 procedurally-generated environments and show that it improves test performance by ~40% relative to standard RL algorithms. Our agent outperforms other baselines specifically designed to improve generalization in RL. In addition, we show that our agent learns policies and representations that are more robust to changes in the environment that do not affect the agent, such as the background. Our implementation is available at https://github.com/rraileanu/auto-drac.

研究动机与目标

解决深度强化学习因对训练环境过拟合而导致的泛化差距。
提出一个理论上可靠的面向 actor-critic 方法的数据增强框架。
开发正则化项以强制策略和价值函数对状态转换的不变性。
通过 UCB、RL2 元学习或 CNN 权重学习自动选择有效的增强。
展示在 Procgen 上的最先进性能以及对无关环境变化的鲁棒性。

提出的方法

引入 Data-regularized Actor-Critic (DrAC) 及两个正则化项：策略正则化和价值函数正则化。
使用一个最优性不变的状态变换 f(s, ν) 来强制不变性：V(s)=V(f(s,ν)) 且 π(a|s)=π(a|f(s,ν))。
维持标准 actor-critic 目标（PPO），并用权重 α_r 对正则化损失 G_π 和 G_V 进行相减。
提供三种自动增强策略：UCB-DrAC（基带目信号选择），RL2-DrAC（元学习选择），Meta-DrAC（CNN 增强权重）。
在同时更新智能体的情况下，将增强选择近似为一个非平稳的赌博问题或元学习问题。
通过循环一致性和 JSD 分析证明不变性与鲁棒性。

实验结果

研究问题

RQ1数据增强是否可以在不破坏目标估计的前提下安全地用于 actor-critic RL 算法？
RQ2我们能否自动识别有助于 RL 泛化的任务特定增强？
RQ3对变换的策略与价值函数正则化是否能在增强观测下提升稳定性与性能？
RQ4在 Procgen 与带干扰的 DM 控制任务中，自动增强方法（UCB-DrAC、RL2-DrAC、Meta-DrAC）对比如何？
RQ5学得的表示是否对无关的视觉变化（例如背景）变得更加不变？

主要发现

方法	训练中位数	训练均值	训练标准差	测试中位数	测试均值	测试标准差
PPO	100.0	100.0	7.2	100.0	100.0	8.5
Rand-FM	93.4	87.6	8.9	91.6	78.0	9.0
IBAC-SNI	91.9	103.4	8.5	86.2	102.9	8.6
Mixreg	95.8	104.2	3.1	105.9	114.6	3.3
PLR	101.5	106.7	5.6	107.1	128.3	5.8
DrAC (Best) (Ours)	114.0	119.6	9.4	118.5	138.1	10.5
RAD (Best)	103.7	109.1	9.6	114.2	131.3	9.4
UCB-DrAC (Ours)	102.3	118.9	8.8	118.5	139.7	8.4
RL2-DrAC	96.3	95.0	8.8	99.1	105.3	7.1
Meta-DrAC	101.3	100.1	8.5	101.7	101.2	7.3

UCB-DrAC 在 Procgen 取得了最先进的性能，超越若干基线并匹配或超过最佳任务增强。
同时对策略和价值函数进行正则化至关重要；DrAC 的表现优于仅对某一分量进行正则化的变体。
使用 UCB-DrAC 的自动增强在各游戏中提供鲁棒、稳定的性能，通常超过固定增强基线。
在带干扰的 DeepMind Control 上，UCB-DrAC 在具有挑战性的背景设置中持续优于 PPO 和 RAD。
在 Procgen 整体上，UCB-DrAC 具有更低的背景敏感性（更高的循环一致性）和表示的不变性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。