QUICK REVIEW

[论文解读] Metatrace: Online Step-size Tuning by Meta-gradient Descent for Reinforcement Learning Control.

Kenny Young, Baoxiang Wang|arXiv (Cornell University)|May 10, 2018

Reinforcement Learning in Robotics参考文献 9被引用 5

一句话总结

Metatrace 通过元梯度下降实现强化学习控制中的在线学习率调节，利用时序信用分配（eligibility traces）在非平稳环境中稳定学习。该方法在线性与非线性函数逼近设置下均显著提升了学习速度，并增强了对初始超参数的鲁棒性，尤其在非平稳性环境下表现更优。

ABSTRACT

Reinforcement learning (RL) has had many successes in both deep and shallow settings. In both cases, significant hyperparameter tuning is often required to achieve good performance. Furthermore, when nonlinear function approximation is used, non-stationarity in the state representation can lead to learning instability. A variety of techniques exist to combat this --- most notably large experience replay buffers or the use of multiple parallel actors. These techniques come at the cost of moving away from the online RL problem as it is traditionally formulated (i.e., a single agent learning online without maintaining a large database of training examples). Meta-learning can potentially help with both these issues by tuning hyperparameters online and allowing the algorithm to more robustly adjust to non-stationarity in a problem. This paper applies meta-gradient descent to derive a set of step-size tuning algorithms specifically for online RL control with eligibility traces. Our novel technique, Metatrace, makes use of an eligibility trace analogous to methods like $TD(\lambda)$. We explore tuning both a single scalar step-size and a separate step-size for each learned parameter. We evaluate Metatrace first for control with linear function approximation in the classic mountain car problem and then in a noisy, non-stationary version. Finally, we apply Metatrace for control with nonlinear function approximation in 5 games in the Arcade Learning Environment where we explore how it impacts learning speed and robustness to initial step-size choice. Results show that the meta-step-size parameter of Metatrace is easy to set, Metatrace can speed learning, and Metatrace can allow an RL algorithm to deal with non-stationarity in the learning task.

研究动机与目标

解决在线强化学习中对超参数敏感的问题，特别是学习率的选择问题。
提升在状态表示随训练过程发生改变的非平稳环境中学习的稳定性。
实现在无需大型经验回放缓冲区或并行智能体的前提下，进行在线、自适应的学习率调整。
提出一种基于元学习的方法，在训练过程中利用时序信用分配动态调节学习率。
在线性与非线性函数逼近设置下，验证方法在鲁棒性与效率上的提升。

提出的方法

通过反向传播强化学习算法的学习动态，利用元梯度下降学习最优学习率。
引入一个元目标，通过时序信用分配传播信用，最小化轨迹上的期望回报。
基于元目标的梯度，推导出全局单一学习率与参数级学习率的更新规则。
使用类似于 TD(λ) 的时序信用分配机制，追踪时间上的信用分配，实现高效的元梯度计算。
维护一个独立的元优化器，根据观测到的学习进展与预测误差更新学习率。
将元学习到的学习率调节机制集成到标准的基于值的强化学习算法（如 Sarsa 和 Q-learning）中，使其具备元学习能力。

实验结果

研究问题

RQ1元梯度下降能否有效应用于结合时序信用分配的在线强化学习学习率调节？
RQ2与固定学习率相比，Metatrace 在非平稳环境中如何提升学习速度与稳定性？
RQ3在非线性函数逼近中，Metatrace 在多大程度上降低了对初始学习率选择的敏感性？
RQ4通过元学习实现的参数级学习率自适应，是否能在复杂控制任务中实现更快收敛与更优性能？
RQ5在状态分布发生漂移的环境中，尤其当不使用传统经验回放缓冲区时，Metatrace 的表现如何？

主要发现

与固定学习率相比，Metatrace 在经典的 Mountain Car 环境中显著加速了学习过程。
在噪声较大、非平稳的 Mountain Car 环境中，Metatrace 保持了稳定学习，而固定学习率则失败或发散。
在 Arcade Learning Environment 中，Metatrace 显著降低了对初始学习率选择的敏感性，实现了在广泛设置下的可靠性能。
Metatrace 中的元学习率超参数易于设定，且在无需大量调优的情况下，对多种任务均保持有效性。
Metatrace 在非线性函数逼近设置中实现了鲁棒学习，在 5 个 Atari 游戏中均提升了收敛速度与稳定性。
Metatrace 中的参数级学习率自适应机制，相比标量学习率调节，实现了更快的策略收敛与更优的最终性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。