QUICK REVIEW

[论文解读] Kickstarting Deep Reinforcement Learning

Simon Schmitt, Jonathan J. Hudson|arXiv (Cornell University)|Mar 10, 2018

Reinforcement Learning in Robotics参考文献 17被引用 44

一句话总结

该论文引入 Kickstarting，一个使用一个或多个预训练教师代理在训练过程中指引新的学生代理的框架，结合 RL 与教师策略交叉熵损失，这些损失逐渐衰减使学生超过其教师。

ABSTRACT

We present a method for using previously-trained 'teacher' agents to kickstart the training of a new 'student' agent. To this end, we leverage ideas from policy distillation and population based training. Our method places no constraints on the architecture of the teacher or student agents, and it regulates itself to allow the students to surpass their teachers in performance. We show that, on a challenging and computationally-intensive multi-task benchmark (DMLab-30), kickstarted training improves the data efficiency of new agents, making it significantly easier to iterate on their design. We also show that the same kickstarting pipeline can allow a single student agent to leverage multiple 'expert' teachers which specialize on individual tasks. In this setting kickstarting yields surprisingly large gains, with the kickstarted agent matching the performance of an agent trained from scratch in almost 10x fewer steps, and surpassing its final performance by 42 percent. Kickstarting is conceptually simple and can easily be incorporated into reinforcement learning experiments.

研究动机与目标

使用先前训练的专家来降低数据需求，促进新 RL 代理的快速训练。
开发一个灵活的 kickstarting 框架，允许任意的教师/学生体系结构。
使学生通过逐渐将重点从教师引导转向环境奖励来超过教师的表现。
扩展到多名教师以获得任务特定的专业知识，并在多任务集合上进行评估。

提出的方法

将 RL 目标与教师和学生策略之间的辅助交叉熵损失结合起来。
引入随时间变化的权重 lambda_k，以调度教师监督的强度。
在策略梯度 RL（A3C/IMPALA 风格）中嵌入该方法，并在可用的情况下使用离策略校正（V-trace）。
可选地通过 Population Based Training (PBT) 在线优化 lambda_k 及其他超参数。
在多教师设置中，使用任务特定的专家并对蒸馏权重进行分解以管理多重监督信号。
提供单教师和多教师场景，并与从零开始训练及纯蒸馏进行对比。

实验结果

研究问题

RQ1Kickstarting 在使用预训练教师时能否在不限制体系结构的情况下加速深度 RL 的学习？
RQ2在教师监督下让学生优化奖励是否能使其性能超过教师？
RQ3单一教师与多教师在多任务 RL 中各自的利弊是什么？
RQ4应如何安排教师引导的影响力（lambda_k）以最大化数据效率和最终表现？

主要发现

使用单一教师在具有挑战性的多任务基准上，Kickstarting 可带来高达 1.5 倍的加速。
学生在 Kickstarting 下能迅速超越其教师。
在多任务场景中使用多任务专用的专家教师时，Kickstarted 智能体在约 9.58 倍更少的步骤中达到从零开始训练的水平，并在最终表现上超越教师 42.2%。
通过 PBT 指导的 lambda_k 调度可达到与最佳人工设计时间表相当的结果，降低了手动超参数调优的需求。
与更大教师相比，Kickstarting 的“学费”更好；仅蒸馏在较长训练中表现劣于 Kickstarting。
组合多个专家可实现跨相关任务的迁移（如激光标记变体和导航）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。