QUICK REVIEW

[논문 리뷰] Text2Reward: Reward Shaping with Language Models for Reinforcement Learning

Tianbao Xie, Siheng Zhao|arXiv (Cornell University)|2023. 09. 20.

Software Engineering Research인용 수 8

한 줄 요약

Text2Reward는 LLM을 사용하여 RL용 촘촘한 보상 함수 생성을 자동화하고, 데이터 없이도 해석 가능한 보상 코드를 제공하여 조작 및 이동 과제 전반의 학습을 안내하며 인간-루프 보정으로 정제한다.

ABSTRACT

Designing reward functions is a longstanding challenge in reinforcement learning (RL); it requires specialized knowledge or domain data, leading to high costs for development. To address this, we introduce Text2Reward, a data-free framework that automates the generation and shaping of dense reward functions based on large language models (LLMs). Given a goal described in natural language, Text2Reward generates shaped dense reward functions as an executable program grounded in a compact representation of the environment. Unlike inverse RL and recent work that uses LLMs to write sparse reward codes or unshaped dense rewards with a constant function across timesteps, Text2Reward produces interpretable, free-form dense reward codes that cover a wide range of tasks, utilize existing packages, and allow iterative refinement with human feedback. We evaluate Text2Reward on two robotic manipulation benchmarks (ManiSkill2, MetaWorld) and two locomotion environments of MuJoCo. On 13 of the 17 manipulation tasks, policies trained with generated reward codes achieve similar or better task success rates and convergence speed than expert-written reward codes. For locomotion tasks, our method learns six novel locomotion behaviors with a success rate exceeding 94%. Furthermore, we show that the policies trained in the simulator with our method can be deployed in the real world. Finally, Text2Reward further improves the policies by refining their reward functions with human feedback. Video results are available at https://text-to-reward.github.io/ .

연구 동기 및 목표

RL에서 보상 설계의 수동 노력과 비용을 자연어 목표를 사용해 줄인다.
밀집된 실행 가능한 보상 코드를 간략한 환경 표현에 기초해 생성한다.
대화형 인간 피드백으로 제로샷 및 소수샷 보상 생성을 가능하게 하고 정제를 돕는다.
시뮬레이션을 넘어 실제 로봇에의 이전과 광범위한 RL 과제를 시연한다.

제안 방법

상태, 객체, 행동의 간결한 파이썬 추상화로 환경을 기반화한다.
자연어 목표를 파이썬에서 실행 가능한 촘촘한 보상 코드로 변환하기 위해 대형 언어 모델을 사용한다.
배경 지식과 소수샷 예시를 활용해 코드 생성을 유도한다.
생성된 보상 코드를 실행해 구문/런타임 에러를 확인하고 LLM 피드백으로 반복적으로 정제한다.
RL 실행 후 인간의 인터랙티브 피드백을 통해 보상 함수를 더 다듬는다.

Figure 1: An overview of Text2Reward of three stages: Expert Abstraction provides an abstraction of the environment as a hierarchy of Pythonic classes. User Instruction describes the goal to be achieved in natural language. User Feedback allows users to summarize the failure mode or their preference

실험 결과

연구 질문

RQ1LLM이 생성한 제로샷 또는 소수샷 촘촘한 보상 코드가 조작 과제에서 전문가가 설계한 보상과 수렴 속도나 성공률 면에서 비교 가능성을 보일 수 있는가?
RQ2목표가 모호하거나 과소 명시된 상황에서 인터랙티브한 인간 피드백이 보상 함수의 질과 RL 성공에 도움을 주는가?
RQ3Text2Reward로 학습된 정책이 광범위한 재학습 없이 실제 로봇 하드웨어로 전이 가능한가?
RQ4생상 보상 코드가 새로운 보행 과제에 일반화되어 학습 분포를 넘어 확장될 수 있는가?

주요 결과

17개 조작 과제 중 13개에서 Text2Reward가 전문가 조정 보상과 성공률 및 수렴 속도 면에서 대등하거나 우수하다.
제로샷 또는 소수샷 Text2Reward가 4개 과제에서 수렴 속도나 성공률 측면에서 전문가 보상을 능가한다.
MuJoCo 보행에서 Text2Reward가 6가지 새로운 행동을 94% 이상의 성공률로 가능하게 한다(사람 평가).
시뮬레이션에서 Text2Reward로 학습된 정책을 최소한의 보정으로 실제 프랑카 파다 프 robot에 배치할 수 있다.
인터랙티브 피드백은 더 나은 성능으로 이어져 작업 모호성을 해결하고 반복마다 성공률을 높일 수 있다.

Figure 2: Learning curves on Maniskill2 under zero-shot and few-shot reward generation settings, measured by task success rate. Oracle means the expert-written reward function provided by the environment; zero-shot and few-shot stands for the reward function is generated by Text2Reward w.o and w. re

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.