QUICK REVIEW

[论文解读] Reinforcement Learning as One Big Sequence Modeling Problem

Michael Jänner, Qiyang Li|arXiv (Cornell University)|Jun 3, 2021

Reinforcement Learning in Robotics被引用 17

一句话总结

该论文将强化学习重新构想为一个单一的、统一的序列建模问题，利用Transformer架构来预测最优的状态、动作和奖励序列。通过将强化学习视为自回归序列预测，该方法消除了对独立的行为克隆、探索约束或不确定性估计的需求，在长时程控制、模仿学习、目标条件强化学习和离线强化学习等多种任务中均取得了优异性能。

ABSTRACT

Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models, leveraging the Markov property to factorize the problem in time. However, we can also view RL as a sequence problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether powerful, high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide simple and effective solutions to the RL problem. To this end, we explore how RL can be reframed as one big sequence modeling problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards. Addressing RL as a sequence problem significantly simplifies a range of design decisions: we no longer require separate behavior policy constraints, as is common in prior work on offline model-free RL, and we no longer require ensembles or other epistemic uncertainty estimators, as is common in prior work on model-based RL. All of these roles are filled by the same Transformer sequence model. In our experiments, we demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL.

研究动机与目标

探究强化学习是否可以被统一在一个单一的序列建模框架下。
消除离线强化学习中对独立行为策略约束的需求。
用单一序列模型替代模型化强化学习中的集成方法和不确定性估计器。
评估Transformer在多种强化学习设置中建模长时程决策序列的有效性。
证明单一、高容量的序列模型无需架构特化即可处理多种强化学习任务。

提出的方法

将强化学习重新表述为自回归序列建模，其中模型根据状态和奖励历史预测动作序列。
使用基于Transformer的架构来建模状态、动作和奖励序列的联合分布。
通过在示范或收集的轨迹上使用监督学习进行端到端训练，将动作序列视为目标。
利用注意力机制捕捉时间步之间的长程依赖关系，而无需依赖循环结构。
在推理过程中使用自回归解码，逐步生成动作序列，条件依赖于先前的状态和动作。
通过依赖模型从示范序列中泛化的容量，避免显式探索或行为克隆。

实验结果

研究问题

RQ1强化学习是否可以被有效统一在单一的序列建模范式下？
RQ2一个单一的Transformer模型是否能够替代行为克隆、不确定性估计和探索约束等多个组件？
RQ3该方法在长时程、目标条件化和离线强化学习任务中的泛化能力如何？
RQ4与传统强化学习方法相比，自回归序列建模在设计和性能方面是否更具优势或更简化？
RQ5高容量的序列模型是否能够在无需显式奖励塑形或辅助目标的情况下学习复杂策略？

主要发现

该方法在无需显式奖励塑形或课程学习的情况下，于长时程控制任务中取得了具有竞争力的性能。
它消除了离线强化学习中对行为克隆或行为策略约束的需求，简化了训练和推理过程。
在目标条件化强化学习中，该模型无需微调或辅助网络即可有效泛化到未见过的目标。
在模仿学习设置中，该方法表现优异，性能与专用行为克隆基线相当或更优。
基于Transformer的序列模型使用单一统一架构，在多种强化学习基准测试中均取得了优异结果。
通过用单一高容量序列模型替代集成模型和不确定性估计器，该方法显著降低了架构复杂度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。