QUICK REVIEW

[论文解读] Distral: Robust Multitask Reinforcement Learning

Yee Whye Teh, Victor Bapst|arXiv (Cornell University)|Jul 13, 2017

Reinforcement Learning in Robotics参考文献 20被引用 183

一句话总结

Distral 引入一种多任务强化学习框架，将共享行为蒸馏到一个中心策略，并将任务策略正则化以趋向该中心策略，在复杂环境中提升跨任务的稳定性和迁移能力。

ABSTRACT

Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a "distilled" policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable---attributes that are critical in deep reinforcement learning.

研究动机与目标

通过多任务学习在深度强化学习中提升数据效率，同时减轻跨任务的负梯度干扰。
提出一种基于蒸馏的机制，在共享策略中捕捉共同行为。
将每个任务策略正则化朝向蒸馏策略，并通过对任务策略的蒸馏来训练蒸馏策略。

提出的方法

定义一个蒸馏策略 π0，捕捉共同的任务行为。
使用折扣的 KL 散度将每个任务策略 πi 正则化朝向 π0，并添加熵正则化以促进探索。
推导带有软化贝尔曼备份和对任务策略采用波尔兹曼形式的软Q学习更新。
将蒸馏策略和任务策略参数化为两列架构，以实现快速迁移和直接梯度流。
解释蒸馏策略如何作为任务策略的质心来学习，以及这如何促进鲁棒的多任务学习。
评估平衡 KL 正则化和熵的若干算法变体，包括交替优化与联合优化。

实验结果

研究问题

RQ1相较于标准多任务 A3C 基线，蒸馏的共享策略是否能在多任务强化学习中提高数据效率和稳定性？
RQ2将 KL 正则化与熵正则化结合对跨任务的迁移、探索和鲁棒性有何影响？
RQ3哪些架构选择（一列式 vs 两列式参数化）和优化方案最能促进迁移与稳定性？

主要发现

基于 Distral 的方法在复杂的三维环境中学习更快，最终性能优于多任务 A3C 基线。
两列变体结合蒸馏提供更快的迁移和更鲁棒的性能，相对于单列变体。
熵正则化有助于维持探索，防止过早收敛，从而提高跨任务的鲁棒性。
基于蒸馏的共享产生类似质心的策略，使学习比仅仅参数共享时更稳定。
Distral 方法显示出更高的稳定性，对超参数设定也更鲁棒，相较于基线多任务 RL 方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。