QUICK REVIEW

[论文解读] Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

Emilio Parisotto, Jimmy Ba|arXiv (Cornell University)|Nov 19, 2015

Reinforcement Learning in Robotics参考文献 18被引用 207

一句话总结

Actor-Mimic 提出了一种深度多任务和迁移强化学习方法，通过使用模型压缩模仿专家网络，训练单一策略网络同时掌握多个Atari游戏。该方法可实现对新任务的泛化，通过多任务预训练中学习到的共享表征，显著加速在未见环境中的学习过程。

ABSTRACT

The ability to act in multiple environments and transfer previous knowledge to new situations can be considered a critical aspect of any intelligent agent. Towards this goal, we define a novel method of multitask and transfer learning that enables an autonomous agent to learn how to behave in multiple tasks simultaneously, and then generalize its knowledge to new domains. This method, termed "Actor-Mimic", exploits the use of deep reinforcement learning and model compression techniques to train a single policy network that learns how to act in a set of distinct tasks by using the guidance of several expert teachers. We then show that the representations learnt by the deep policy network are capable of generalizing to new tasks with no prior expert guidance, speeding up learning in novel environments. Although our method can in general be applied to a wide range of problems, we use Atari games as a testing environment to demonstrate these methods.

研究动机与目标

开发一种方法，使单一深度强化学习智能体能够同时学习多个任务。
利用共享表征，实现从源任务到新、未见目标任务的知识迁移。
利用模型压缩技术，借助专家指导训练紧凑的多任务策略网络。
证明多任务预训练相比随机初始化，可显著提升在新任务上的学习速度。

提出的方法

该方法使用模仿学习，训练单一深度策略网络（即“模仿者”）以模仿多个游戏特定的专家网络。
应用模型压缩技术，将专家知识蒸馏到共享的紧凑策略网络中。
使用特征回归目标，提供比仅动作模仿更丰富的监督信号，从而改善表征学习。
在新目标任务上对多任务网络进行微调，展示迁移学习的优势。
该方法使用类似DQN的回放缓冲区和目标网络，以保证训练稳定性。
该方法在Atari 2600游戏的街机学习环境（ALE）上进行评估。

实验结果

研究问题

RQ1能否通过专家指导，训练单一深度策略网络，在多个不同的强化学习任务上均表现良好？
RQ2使用Actor-Mimic进行多任务预训练，是否能显著加快在新、此前未见过的任务上的学习速度？
RQ3多任务训练过程中学习到的表征是否能有效泛化到新环境？
RQ4与仅动作模仿相比，引入中间特征监督是否能提升性能？
RQ5任务相似性对本框架中迁移学习成功的影响如何？

主要发现

Actor-Mimic网络使用单一共享策略网络，在多个Atari游戏中达到了专家水平的性能。
与随机初始化相比，使用Actor-Mimic进行多任务预训练可显著加速在新目标任务上的学习过程。
使用特征回归作为监督信号，相比仅动作模仿，能带来更好的泛化性能。
由于共享视觉和结构特征，该方法在机制相似的任务（如Pong和Breakout）之间具有良好的泛化能力。
当源任务与目标任务差异较大时，由于负迁移现象，迁移学习效果下降。
该方法在同时学习多个任务的同时，保持了与单任务DQN相当的模型复杂度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。