QUICK REVIEW

[论文解读] Self-Tuning Deep Reinforcement Learning

Tom Zahavy, Zhongwen Xu|arXiv (Cornell University)|Feb 28, 2020

Reinforcement Learning in Robotics参考文献 10被引用 8

一句话总结

本文提出自调制演员-评论家（STAC），一种深度强化学习方法，利用可微分交叉验证和元梯度，在训练过程中自动调整超参数。STAC 在 Atari 2600 上将中位人类归一化得分从 243% 提升至 364%，在 2 亿帧内完成，计算资源未增加，样本效率得到提升。

ABSTRACT

Reinforcement learning (RL) algorithms often require expensive manual or automated hyperparameter searches in order to perform well on a new domain. This need is particularly acute in modern deep RL architectures which often incorporate many modules and multiple loss functions. In this paper, we take a step towards addressing this issue by using metagradients (Xu et al., 2018) to tune these hyperparameters via differentiable cross validation, whilst the agent interacts with and learns from the environment. We present the Self-Tuning Actor Critic (STAC) which uses this process to tune the hyperparameters of the usual loss function of the IMPALA actor critic agent(Espeholt et. al., 2018), to learn the hyperparameters that define auxiliary loss functions, and to balance trade offs in off policy learning by introducing and adapting the hyperparameters of a novel leaky V-trace operator. The method is simple to use, sample efficient and does not require significant increase in compute. Ablative studies show that the overall performance of STAC improves as we adapt more hyperparameters. When applied to 57 games on the Atari 2600 environment over 200 million frames our algorithm improves the median human normalized score of the baseline from 243% to 364%.

研究动机与目标

减少在复杂架构（尤其是包含多个损失函数）中对人工或自动化超参数调优的需求。
实现在训练过程中自动调整超参数，提升样本效率和性能。
通过学习主损失函数和辅助损失函数的最优超参数，扩展 IMPALA 演员-评论家框架。
引入并适配一种新型可学习漏斗 V-trace 算子，其超参数可调，以平衡离策略学习的权衡。
证明调优更多超参数可在多种环境中带来一致的性能提升。

提出的方法

使用元梯度计算验证损失相对于超参数的梯度，实现端到端的超参数优化。
应用可微分交叉验证，在训练过程中评估超参数性能，无需额外的验证轨迹。
引入一种可学习的漏斗 V-trace 算子，通过自适应超参数调整离策略校正。
在统一的训练循环中同时调优主损失函数、辅助损失函数和 V-trace 算子的超参数。
通过避免额外的环境滚动或计算开销，保持样本效率。
采用元优化循环，基于可微分验证指标的性能更新超参数。

实验结果

研究问题

RQ1能否使用可微分方法在训练过程中自动调优深度强化学习中的超参数？
RQ2在复杂强化学习智能体中，对多个超参数进行自调优是否能提升样本效率和最终性能？
RQ3一种具有自适应超参数的新型漏斗 V-trace 算子能否提升离策略学习的稳定性和性能？
RQ4自调优智能体的性能与在多种环境中采用固定超参数的基线相比如何？
RQ5增加可调超参数的数量是否能带来可测量的性能提升？

主要发现

STAC 在 20000 万帧内，对 57 个 Atari 2600 游戏的中位人类归一化得分从 243% 提升至 364%。
性能提升在各类环境中保持一致，随着调优的超参数数量增加，性能提升更加显著。
该方法在不增加计算资源或额外环境交互的前提下实现了更优性能。
消融实验表明，调优更多超参数可带来更大的性能增益，验证了该方法的可扩展性。
可微分交叉验证的使用使训练过程中超参数更新更加稳定且高效。
自调优的漏斗 V-trace 算子能有效平衡离策略学习的权衡，显著提升样本效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。