QUICK REVIEW

[论文解读] Offline Reinforcement Learning as One Big Sequence Modeling Problem

Michael Jänner, Qiyang Li|arXiv (Cornell University)|Jun 3, 2021

Reinforcement Learning in Robotics被引用 41

一句话总结

本文将轨迹视为统一序列，并使用带束搜索的 Transformer（Trajectory Transformer）来进行模仿学习、目标条件强化学习和离线强化学习，在不依赖大量传统 RL 组件的情况下实现具有竞争力或最先进的结果。

ABSTRACT

Reinforcement learning (RL) is typically concerned with estimating stationary policies or single-step models, leveraging the Markov property to factorize problems in time. However, we can also view RL as a generic sequence modeling problem, with the goal being to produce a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide effective solutions to the RL problem. To this end, we explore how RL can be tackled with the tools of sequence modeling, using a Transformer architecture to model distributions over trajectories and repurposing beam search as a planning algorithm. Framing RL as sequence modeling problem simplifies a range of design decisions, allowing us to dispense with many of the components common in offline RL algorithms. We demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL. Further, we show that this approach can be combined with existing model-free algorithms to yield a state-of-the-art planner in sparse-reward, long-horizon tasks.

研究动机与目标

将强化学习重新框架为统一的序列建模问题，以简化设计并利用高容量的序列模型。
用 Transformer 架构展示长时程轨迹预测的准确性。
展示 Trajectory Transformer 上的束搜索规划在离线 RL 中具有竞争力，并能实现模仿学习和目标条件 RL。
探究解码的变体如何提供基于模型的规划和目标达成能力。
评估该序列建模方法是否能够达到甚至超过专门的离线 RL 方法。

提出的方法

将轨迹表示为离散化、自回归建模的状态、动作和奖励序列。
训练一个 Transformer 解码器（Trajectory Transformer）来建模 P(theta)(s_t, a_t, r_t | history)。
通过均匀离散化或分位数离散化将连续维离散化，形成离散的 token 序列。
将束搜索作为规划算法，通过最大化（或近似）序列似然性或奖励来生成高奖励轨迹。
用 reward-to-go 来增强奖励以指导离线规划，并在稀疏奖励任务中可选地将 Q 函数作为搜索启发式。
在模仿学习、目标条件 RL 和离线 RL 中应用相同的解码过程，仅对条件输入和序列长度进行极少的修改。

实验结果

研究问题

RQ1高容量序列模型（Transformer）是否能够在不采用传统 RL 分解的情况下，准确预测长时程轨迹？
RQ2在基于轨迹的模型上进行束搜索规划是否能与专门的离线 RL 方法相比？
RQ3同一模型是否能够通过简单的解码策略支持模仿学习、目标条件 RL 和离线 RL？
RQ4在稀疏奖励任务中引入 reward-to-go 或 Q 函数启发式是否能提升规划效果？

主要发现

Trajectory Transformer 的长时程预测准确性显著高于标准的单步动力学模型，在 100 步内保持了合理性。
在离线 RL 基准测试中，TT（分位数离散化）在多种机动任务上与最先进方法相匹配或超越，优于若干基线。
将 TT 规划与 Q 函数作为搜索启发式结合，在稀疏奖励任务（AntMaze）上表现优于 IQL 与返回条件化方法。
通过标准束搜索用 TT 进行模仿学习和目标达成，取得高性能，展示了解码驱动规划的多样性。
解码的变体（如通过在前置目标状态来进行目标条件）实现了无需奖励或塑形即可的目标达成。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。