QUICK REVIEW

[论文解读] Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks

Haoqi Yuan, Chi Zhang|arXiv (Cornell University)|Mar 29, 2023

Multimodal Machine Learning Applications被引用 11

一句话总结

Plan4MC 使用内在奖励学习细粒度技能，通过 LLM 构建技能图，并使用基于 DFS 的技能规划器在没有示范的情况下解决 40 个多样化的 Minecraft 任务，具备较强的样本效率。

ABSTRACT

We study building multi-task agents in open-world environments. Without human demonstrations, learning to accomplish long-horizon tasks in a large open-world environment with reinforcement learning (RL) is extremely inefficient. To tackle this challenge, we convert the multi-task learning problem into learning basic skills and planning over the skills. Using the popular open-world game Minecraft as the testbed, we propose three types of fine-grained basic skills, and use RL with intrinsic rewards to acquire skills. A novel Finding-skill that performs exploration to find diverse items provides better initialization for other skills, improving the sample efficiency for skill learning. In skill planning, we leverage the prior knowledge in Large Language Models to find the relationships between skills and build a skill graph. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines by a large margin and is the most sample-efficient demonstration-free RL method to solve Minecraft Tech Tree tasks. The project's website and code can be found at https://sites.google.com/view/plan4mc.

研究动机与目标

在开放世界环境中无的人类示范下，激励构建能够处理多任务的智能体。
将长时程任务分解为细粒度基本技能的序列（Finding、Manipulation、Crafting）。
利用内在奖励训练技能并引入新颖的 Finding-skill 以改善探索与初始化。
使用由 LLM 生成的技能图和基于 DFS 的规划器，为任务生成可执行的技能序列。

提出的方法

定义三类细粒度的 Minecraft 技能：Finding-skills、Manipulation-skills 与 Crafting-skills。
用内在奖励通过强化学习训练基本技能；引入层次化的 Finding-skill 用于探索，以改善其他技能的初始化。
通过提示 LLM 描述技能关系及先决条件来构建技能图；在图上进行基于 DFS 的规划以产生可执行的技能序列。
开发一个技能搜索算法，迭代执行计划的技能并更新库存以规划后续步骤。
在 MineDojo 的 40 项任务上，将 Plan4MC 与基线方法（MineAgent、无 Find-skill 的 Plan4MC、交互式 LLM 规划、零-shot 及半步变体）进行对比。
证明 Plan4MC 在长时程任务上具有更高的样本效率和更高的成功率。

实验结果

研究问题

RQ1在大型开放世界环境中，RL 能否在没有演示的情况下学习细粒度的基本技能？
RQ2引入 Finding-skill 是否提高了长时程 Minecraft 任务的样本效率和总体任务成功率？
RQ3由 LLM 生成的技能图结合 DFS-based 规划器是否能够有效地分解并解决开放世界任务？

主要发现

Task Set	MineAgent	Plan4MC w/o Find-skill	Interactive LLM	Plan4MC Zero-shot	Plan4MC 1/2-steps	Plan4MC
Cut-Trees	0.003	0.187	0.260	0.183	0.337	0.417
Mine-Stones	0.026	0.097	0.067	0.000	0.163	0.293
Mine-Ores	0.000	0.243	0.030	0.000	0.143	0.267
Interact-Mobs	0.171	0.170	0.247	0.133	0.277	0.320

Plan4MC 实现了 40 项多样化的 Minecraft 任务，通常每项任务需要 2–30 个基本技能。
Plan4MC 在四组任务上优于基线方法，并且在无需示范的强化学习方法中具有更高的样本效率。
包含 Finding-skill 相较于不含 Finding-skill 的 Plan4MC 显著提高了成功率。
基于 LLM 的技能图加上技能搜索算法提供了鲁棒且可执行的计划，规划错误比单纯依赖 LLM 的规划更少。
交互式 LLM 规划在简单任务上可以达到 Plan4MC 的水平，但在长时程任务上因规划错误而表现不佳。
Plan4MC 在 Minecraft 技术树中铸造铁镐方面表现出色。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。