QUICK REVIEW

[论文解读] Experience Replay for Continual Learning

David Rolnick, Arun Ahuja|arXiv (Cornell University)|Nov 28, 2018

Domain Adaptation and Few-Shot Learning参考文献 29被引用 375

一句话总结

CLEAR 使用来自新数据的在策略学习与带行为克隆的离策略重放的混合，显著减少持续强化学习中的灾难性遗忘。

ABSTRACT

Continual learning is the problem of learning new tasks or knowledge while protecting old knowledge and ideally generalizing from old experience to learn new tasks faster. Neural networks trained by stochastic gradient descent often degrade on old tasks when trained successively on new tasks with different data distributions. This phenomenon, referred to as catastrophic forgetting, is considered a major hurdle to learning with non-stationary data or sequences of new tasks, and prevents networks from continually accumulating knowledge and skills. We examine this issue in the context of reinforcement learning, in a setting where an agent is exposed to tasks in a sequence. Unlike most other work, we do not provide an explicit indication to the model of task boundaries, which is the most general circumstance for a learning agent exposed to continuous experience. While various methods to counteract catastrophic forgetting have recently been proposed, we explore a straightforward, general, and seemingly overlooked solution - that of using experience replay buffers for all past events - with a mixture of on- and off-policy learning, leveraging behavioral cloning. We show that this strategy can still learn new tasks quickly yet can substantially reduce catastrophic forgetting in both Atari and DMLab domains, even matching the performance of methods that require task identities. When buffer storage is constrained, we confirm that a simple mechanism for randomly discarding data allows a limited size buffer to perform almost as well as an unbounded one.

研究动机与目标

在持续强化学习中激发并解决稳定性-可塑性权衡。
开发一个基于重放的框架，在连续任务中降低灾难性遗忘。
实现无需显式任务边界或任务身份假设的学习。

提出的方法

使用带有新体验和重放体验混合的 actor-critic 训练以及 V-Trace 离策略校正。
在当前策略和过去自我之间应用行为克隆，以稳定重放学习。
将用于可塑性的在策略更新与用于稳定性的离策略更新结合；对重放数据包含克隆损失。
在内存有限时，使用分布式的类似 IMPALA 的架构并结合水库采样来管理重放缓冲区。

实验结果

研究问题

RQ1在任务按序呈现的持续强化学习中，经验重放是否能降低灾难性遗忘？
RQ2将在策略学习与离策略重放（加上行为克隆）混合，是否在不牺牲可塑性的情况下改善稳定性？
RQ3与任务感知方法（如 EWC、Progress & Compress）相比，CLEAR 的表现以及在分别任务学习和同时任务学习中的表现如何？
RQ4缓冲区大小以及在策略/离策略平衡对学习动力学和遗忘的影响有哪些？

主要发现

CLEAR 在循环任务和顺序任务设置中显著降低灾难性遗忘。
CLEAR 的累计表现可与分开或同时在任务上训练的表现相当，有效消除了遗忘。
行为克隆增强稳定性，而离策略重放在学习新任务的同时仍支持对 past tasks 的学习。
新数据与重放数据各占50% 时，在稳定性与可塑性之间给出良好折中；100% 重放会损害新任务的早期学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。