QUICK REVIEW

[论文解读] Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels

Ilya Kostrikov, Denis Yarats|arXiv (Cornell University)|Apr 28, 2020

Domain Adaptation and Few-Shot Learning被引用 171

一句话总结

DrQ 引入数据正则化 Q（data-regularized Q），一种简单的像素数据增强框架，通过正则化 Q 函数和价值目标来实现直接从像素进行鲁棒学习，在无模型强化学习中在 DeepMind Control 和 Atari 100k 上取得了最先进的结果。

ABSTRACT

We propose a simple data augmentation technique that can be applied to standard model-free reinforcement learning algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training. The approach leverages input perturbations commonly used in computer vision tasks to regularize the value function. Existing model-free approaches, such as Soft Actor-Critic (SAC), are not able to train deep networks effectively from image pixels. However, the addition of our augmentation method dramatically improves SAC's performance, enabling it to reach state-of-the-art performance on the DeepMind control suite, surpassing model-based (Dreamer, PlaNet, and SLAC) methods and recently proposed contrastive learning (CURL). Our approach can be combined with any model-free reinforcement learning algorithm, requiring only minor modifications. An implementation can be found at https://sites.google.com/view/data-regularized-q.

研究动机与目标

在没有辅助损失或预训练的情况下，直接从图像观测中激励样本高效的强化学习。
展示输入扰动加上价值函数正则化在从像素进行的离策略强化学习中减少过拟合。
证明 DrQ 在 DeepMind control 套件和 Atari 100k 上实现了最先进的性能。
提供一个实用、与算法无关的实现，可以与 SAC 和 DQN 配对使用。

提出的方法

仅在从回放缓冲区采样时对观测应用图像变换（随机平移）。
引入最优性不变的状态变换，通过确保变换后的状态产生相同的 Q 值来正则化 Q 函数。
对多个增强变换的目标 Q 值进行聚合，以降低估计方差。
对多次增强的 Q 函数进行聚合，以正则化学习（目标 Q 与在线 Q 都如此）。
在 DrQ 中将这些机制结合起来，并与标准的离策略 Actors-Critics（SAC 和 DQN）配对使用，而不改变核心算法。

实验结果

研究问题

RQ1简单的基于图像的增强是否能正则化 Q 学习，从而实现直接从像素进行的有效学习？
RQ2保持最优性的状态与目标变换是否能够改善从像素进行的离策略强化学习的稳定性和数据效率？
RQ3在 DeepMind Control 套件和 Atari 100k 上，DrQ 相对于最先进的模型自由和模型基方法的表现如何？
RQ4该方法在任务和超参数设置下是否鲁棒，同时保持易于实现？

主要发现

使用像素增强的 DrQ 在 DeepMind control 套件上取得强劲表现，超过了若干模型基和对比方法。
DrQ 提高了数据效率，且通常可与在内部状态上训练的 SAC 相媲美甚至超越，且无需辅助损失或世界模型。
在 Atari 100k 上，DrQ 结合 Efficient DQN 在同类方法中设置了新的中位数性能最优。
该方法实现简单，计算开销微不足道，且在超参数设置下具有鲁棒性。
将 DrQ 应用于 DQN 风格的代理时也提供改进，展示了在动作空间中的广泛适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。