[论文解读] Stabilizing Transformers for Reinforcement Learning
本文提出 Gated Transformer-XL (GTrXL),一种门控、重新排序的层 transformer 架构,稳定训练并提高基于记忆的强化学习性能,在 DMLab-30 和可扩展记忆任务上超越 LSTMs 与外部记忆。
Owing to their ability to both effectively integrate information over long time horizons and scale to massive amounts of data, self-attention architectures have recently shown breakthrough success in natural language processing (NLP), achieving state-of-the-art results in domains such as language modeling and machine translation. Harnessing the transformer's ability to process long time horizons of information could provide a similar performance boost in partially observable reinforcement learning (RL) domains, but the large-scale transformers used in NLP have yet to be successfully applied to the RL setting. In this work we demonstrate that the standard transformer architecture is difficult to optimize, which was previously observed in the supervised learning setting but becomes especially pronounced with RL objectives. We propose architectural modifications that substantially improve the stability and learning speed of the original Transformer and XL variant. The proposed architecture, the Gated Transformer-XL (GTrXL), surpasses LSTMs on challenging memory environments and achieves state-of-the-art results on the multi-task DMLab-30 benchmark suite, exceeding the performance of an external memory architecture. We show that the GTrXL, trained using the same losses, has stability and performance that consistently matches or exceeds a competitive LSTM baseline, including on more reactive tasks where memory is less critical. GTrXL offers an easy-to-train, simple-to-implement but substantially more expressive architectural alternative to the standard multi-layer LSTM ubiquitously used for RL agents in partially observable environments.
研究动机与目标
- 在长遥远时间尺度、部分可观测的强化学习问题中,推动使用 Transformer。
- 识别 canonical transformers 在 RL 设置中的训练不稳定性。
- 提出架构修改(身份映射重新排序和门控)以稳定学习。
- 证明 GTrXL 在记忆基准上超过 LSTMs 和外部记忆。
- 在保持竞争性能的同时,展示对种子和超参数的鲁棒性。
提出的方法
- 将 Transformer-XL 架构适配为 RL 记忆,使用相对位置编码。
- 通过仅将 LayerNorm 放在子模块输入流上来应用身份映射重新排序(TrXL-I)。
- 引入门控机制以替代 MHA 和 MLP 子模块中的残差连接(GTrXL)。
- 探索 GRU 型门控作为最强变体(GTrXL GRU)以及若干消融变体(Input、Output、Highway、SigTanh)。
- 将门控偏置初始化为鼓励近似恒等映射,以启动马尔可夫策略学习。
- 采用 V-MPO(一个 on-policy MPO 变体)进行训练,以评估跨领域的学习稳定性和性能。
实验结果
研究问题
- RQ1变换器(Transformers)是否能够被稳定地用于作为 RL 智能体的记忆架构?
- RQ2哪些架构变化(层归一化顺序和门控)可以提高 Transformer 在 RL 训练中的稳定性?
- RQ3在需要记忆的 RL 基准测试中,GTrXL 相对于 LSTMs 和外部记忆架构的表现如何?
- RQ4GTrXL 对超参数、种子和变化的记忆时域的鲁棒性如何?
- RQ5GTrXL 能否随记忆时域扩展并在复杂任务上继续优于传统架构?
主要发现
- GTrXL with GRU gating substantially outperforms a competitive 3-layer LSTM baseline on DMLab-30 in memory-based environments.
- GTrXL (GRU) achieves state-of-the-art results on the multitask DMLab-30 benchmark, surpassing external memory MERLIN in final performance.
- GTrXL demonstrates better memory horizon scaling than LSTMs in Numpad tasks, maintaining superior performance as memory demands grow.
- GTrXL variants with gating outperform other gating options in stability and learning speed, with GRU gating providing the strongest results across tasks.
- GTrXL remains competitive or superior on reactive tasks where memory is less critical, indicating broad applicability as an RL memory replacement for LSTMs.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。