QUICK REVIEW

[논문 리뷰] Offline Reinforcement Learning as One Big Sequence Modeling Problem

Michael Jänner, Qiyang Li|arXiv (Cornell University)|2021. 06. 03.

Reinforcement Learning in Robotics인용 수 41

한 줄 요약

논문은 궤적을 하나의 통합 시퀀스로 간주하고 Beam search를 사용하는 Transformer(trajectory Transformer)를 활용해 imitation learning, goal-conditioned RL, offline RL 수행; 전통적 RL 구성요소를 많이 포함하지 않고도 경쟁력 있거나 최첨단 결과 달성.

ABSTRACT

Reinforcement learning (RL) is typically concerned with estimating stationary policies or single-step models, leveraging the Markov property to factorize problems in time. However, we can also view RL as a generic sequence modeling problem, with the goal being to produce a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide effective solutions to the RL problem. To this end, we explore how RL can be tackled with the tools of sequence modeling, using a Transformer architecture to model distributions over trajectories and repurposing beam search as a planning algorithm. Framing RL as sequence modeling problem simplifies a range of design decisions, allowing us to dispense with many of the components common in offline RL algorithms. We demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL. Further, we show that this approach can be combined with existing model-free algorithms to yield a state-of-the-art planner in sparse-reward, long-horizon tasks.

연구 동기 및 목표

RL을 설계 단순화하고 고용량 시퀀스 모델을 활용하기 위해 통합 시퀀스 모델링 문제로 재정의한다.
트랜스포머 아키텍처를 사용한 장기 궤적 예측 정확도를 시연한다.
Trajectory Transformer에서의 빔 탐색 계획이 오프라인 RL에서 경쟁력 있는 결과를 내고 imitation learning 및 goal-conditioned RL를 가능하게 한다는 것을 보여준다.
디코딩의 변형이 모델 기반 계획 및 목표 도달 능력을 어떻게 제공하는지 탐구한다.
이 시퀀스 모델링 접근법이 전문화된 오프라인 RL 방법과 대등하거나 더 나은 결과를 내는지 평가한다.

제안 방법

궤적을 이산화된 자기회귀 모델링된 상태, 행동, 보상의 시퀀스로 표현한다.
트랜스포머 디코더(Trajectory Transformer)를 학습시켜 P(theta)(s_t, a_t, r_t | history)를 모델링한다.
연속 차원을 균일 분포 분할 또는 분위수 기반 이산화로 이산 토큰 스트림을 형성한다.
빔 탐색을 계획 알고리즘으로 사용하여 시퀀스 우도나 보상을 최대화(또는 근사)함으로써 높은 보상의 궤적을 생성한다.
오프라인 계획을 안내하기 위해 보상-투-고를 보강하고, 희박 보상 작업에서 탐색 휴리스틱으로 Q-함수를 통합하는 것을 선택적으로 허용한다.
조건 입력 및 시퀀스 길이에 최소한의 변경으로 imitation learning, goal-conditioned RL, offline RL에 걸쳐 동일한 디코딩 절차를 적용한다.

실험 결과

연구 질문

RQ1고전적 RL 분해 없이도 고용량 시퀀스 모델(트랜스포머)이 장기 궤적을 정확하게 예측할 수 있는가?
RQ2궤적 기반 모델에서의 빔 탐색 계획이 전문화된 오프라인 RL 방법과 경쟁력이 있는가?
RQ3같은 모델이 간단한 디코딩 전략을 통해 imitation learning, goal-conditioned RL, offline RL을 지원할 수 있는가?
RQ4보상-투-고 또는 Q-함수 휴리스틱을 도입하면 희박 보상 작업에서 계획이 개선되는가?

주요 결과

Trajectory Transformer는 표준 단일 스텝 다이나믹스 모델보다 훨씬 나은 장기 예측 정확도를 제공하며 100단계까지도 타당성을 유지한다.
오프라인 RL 벤치마크에서 TT(quantile 이산화)는 로봇 제어 태스크 전반에서 최첨단 방법과 대등하거나 우수하며 여러 기준선을 상회한다.
TT 계획과 Q-함수를 탐색 휴리스틱으로 결합하면 AntMaze와 같은 희박 보상 작업에서 IQL 및 보상-경계 접근 방식보다 우수한 성능을 보인다.
TT를 통한 imitation learning 및 목표 도달은 표준 빔 탐색으로도 높은 성능을 달성하여 디코딩 기반 계획의 다재다능성을 입증한다.
디코딩의 변형(예: 목표 상태를 앞에 두는 목표 조건화)은 보상 없이도 목표 도달이 가능하도록 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.