QUICK REVIEW

[论文解读] Text2Reward: Reward Shaping with Language Models for Reinforcement Learning

Tianbao Xie, Siheng Zhao|arXiv (Cornell University)|Sep 20, 2023

Software Engineering Research被引用 8

一句话总结

Text2Reward 使用大模型为强化学习自动生成密集奖励函数，实现数据无关、可解释的奖励代码，覆盖操作与运动任务，并支持人机交互式 refinements.

ABSTRACT

Designing reward functions is a longstanding challenge in reinforcement learning (RL); it requires specialized knowledge or domain data, leading to high costs for development. To address this, we introduce Text2Reward, a data-free framework that automates the generation and shaping of dense reward functions based on large language models (LLMs). Given a goal described in natural language, Text2Reward generates shaped dense reward functions as an executable program grounded in a compact representation of the environment. Unlike inverse RL and recent work that uses LLMs to write sparse reward codes or unshaped dense rewards with a constant function across timesteps, Text2Reward produces interpretable, free-form dense reward codes that cover a wide range of tasks, utilize existing packages, and allow iterative refinement with human feedback. We evaluate Text2Reward on two robotic manipulation benchmarks (ManiSkill2, MetaWorld) and two locomotion environments of MuJoCo. On 13 of the 17 manipulation tasks, policies trained with generated reward codes achieve similar or better task success rates and convergence speed than expert-written reward codes. For locomotion tasks, our method learns six novel locomotion behaviors with a success rate exceeding 94%. Furthermore, we show that the policies trained in the simulator with our method can be deployed in the real world. Finally, Text2Reward further improves the policies by refining their reward functions with human feedback. Video results are available at https://text-to-reward.github.io/ .

研究动机与目标

通过使用自然语言目标，减少奖励设计的人力成本与工作量。
生成基于紧凑环境表示的密集、可执行奖励代码。
结合交互式人类反馈实现零-shot 与少-shot 的奖励生成与 refine。
展示在真实机器人及仿真实验之外的广泛 RL 任务中的迁移能力。

提出的方法

将环境约束在对状态、对象与动作的紧凑 Python 抽象中。
利用大型语言模型把自然语言目标转化为可以在 Python 中执行的密集奖励代码。
结合背景知识与少量示例来引导代码生成。
执行生成的奖励代码以捕捉语法/运行时错误，并通过迭代的LLM反馈进行 refine。
在 RL 回合后实现人机交互式反馈，进一步 refined 奖励函数。

Figure 1: An overview of Text2Reward of three stages: Expert Abstraction provides an abstraction of the environment as a hierarchy of Pythonic classes. User Instruction describes the goal to be achieved in natural language. User Feedback allows users to summarize the failure mode or their preference

实验结果

研究问题

RQ1LLM 生成的零-shot 或少-shot 的密集奖励代码是否在操作任务上达到与专家设计奖励相当的性能？
RQ2在人类反馈存在模糊或未定义目标的情景中，交互式反馈是否提升奖励函数质量与 RL 成功率？
RQ3在现实机器人硬件上，使用 Text2Reward 训练得到的策略是否能在无需大量重新训练的情况下转移？
RQ4奖励代码是否可泛化到超出训练分布的新颖运动任务？

主要发现

在 17 个操作任务中的 13 个上，Text2Reward 的成功率与收敛速度与专家调优的奖励相当或更优。
零-shot 或少-shot 的 Text2Reward 在 4 个任务上在收敛速度或成功率方面优于专家奖励。
在 MuJoCo 运动任务中，Text2Reward 支持六种新行为，人工评估的成功率超过 94%。
在仿真中使用 Text2Reward 训练的策略可在实际的フランカ Panda 机器人上部署，且需要的标定很少。
互动式反馈可进一步提升性能，解决任务歧义并在迭代中提高成功率。

Figure 2: Learning curves on Maniskill2 under zero-shot and few-shot reward generation settings, measured by task success rate. Oracle means the expert-written reward function provided by the environment; zero-shot and few-shot stands for the reward function is generated by Text2Reward w.o and w. re

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。