QUICK REVIEW

[论文解读] A Self-Tuning Actor-Critic Algorithm

Tom Zahavy, Zhongwen Xu|arXiv (Cornell University)|Feb 28, 2020

Reinforcement Learning in Robotics参考文献 29被引用 32

一句话总结

STAC 与 STACX 使用元梯度对演员-评论家损失的可微分超参数进行自调校，并结合 Leaky V-trace 变体与辅助任务，在 ALE 与 DM Control 上保持一致的性能提升且几乎不增加额外计算开销。

ABSTRACT

Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al., 2018). We apply our algorithm, Self-Tuning Actor-Critic (STAC), to self-tune all the differentiable hyperparameters of an actor-critic loss function, to discover auxiliary tasks, and to improve off-policy learning using a novel leaky V-trace operator. STAC is simple to use, sample efficient and does not require a significant increase in compute. Ablative studies show that the overall performance of STAC improved as we adapt more hyperparameters. When applied to the Arcade Learning Environment (Bellemare et al. 2012), STAC improved the median human normalized score in 200M steps from 243% to 364%. When applied to the DM Control suite (Tassa et al., 2018), STAC improved the mean score in 30M steps from 217 to 389 when learning with features, from 108 to 202 when learning from pixels, and from 195 to 295 in the Real-World Reinforcement Learning Challenge (Dulac-Arnold et al., 2020).

研究动机与目标

通过让元梯度实现在线自调，以减少深度 RL 的人工超参数调节的必要性。
开发 STAC，自动优化 IMPALA 损失中的所有可微分超参数，并引入 Leaky V-trace。
在 STAC 的基础上扩展辅助任务（STACX），以在自调超参数的同时发现有益的辅助损失。
通过消融研究和鲁棒性分析，在多样化领域（ALE 和 DM Control）展示经验性能提升。

提出的方法

将内损失参数化为元参数参量 = {gamma, lambda, g_v, g_p, g_e}，外部损失包含一个 KL 正则项以防止策略漂移。
应用元梯度更新，通过外部损失上的可微分元优化器（Adam）进行在线自调超参数。
引入 Leaky V-trace，这是重要性采样和截断 IS 之间的可微分插值，由泄漏参数 alpha 控制。
对于 STACX，增加具有自身元参数的辅助头以学习有助于共享表示的辅助任务，外部损失聚焦于主头。
使用一个共享表示骨干（类似 ResNet）并带有多个头；每个辅助头通过 Leaky V-trace 为离策略校正优化其自身的可微损失。

实验结果

研究问题

RQ1元梯度是否可以在在线、单次寿命的 RL 设置中，自调一大组可微分超参数？
RQ2自调超参数是否在不同领域（ALE 与 DM Control）提升采样效率和最终性能？
RQ3Leaky V-trace 对离策略 actor-critic 学习的稳定性与性能有何影响？
RQ4辅助任务（STACX）及其自调元参数是否进一步提升表征学习与性能？

主要发现

STACX 在 Atari 200M 帧上的中位数人类标准化分数达到 364%（基线为 243%）。
在 DM Control 上，STACX/STAC 在特征、像素与 RWRL 设置下的平均分数均有提升（如从特征的 217 提升至 389，从像素的 108 提升至 202，以及在 RWRL 中从 195 提升至 295）。
消融研究表明自调更多的元参数时性能提升，STACX 一贯胜过 IMPALA 基线。
STACX 对外部超参数扰动表现出鲁棒性，训练中显示出可解释的元参数轨迹。
STACX 能扩展到 21 个自调超参数（相较于以往工作较少的参数数量），且计算开销并未显著增加。
STACX 的辅助头在像素化的 DM Control 上带来额外收益，且在基于特征的设置中并非普遍存在。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。