QUICK REVIEW

[论文解读] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael J. Ahn, Anthony Brohan|arXiv (Cornell University)|Apr 4, 2022

Multimodal Machine Learning Applications被引用 512

一句话总结

论文提出 SayCan，是一个框架，通过将高层规划与预训练技能的学习到的可用性相结合，将大语言模型（LLMs）在机器人领域 grounding，使移动操作器实现真实世界、长时域指令执行。

ABSTRACT

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.

研究动机与目标

说明 LLM 缺乏现实世界的扎根，并在具身代理上部署时可能失败。
提出用来自预训练技能的面向世界的可用性来对 LLM 输出进行扎根。
实现可解释、逐步执行计划，与机器人环境中的可行性相符。
在带有移动机器人的厨房长时任务上展示真实世界的性能。

提出的方法

用策略表示每个低层技能以及一个 TD 训练的价值函数（可用性）。
对每个技能描述 ell_pi 给出指令 i 时从 LLM 计算 p(ell_pi|i)。
将 p(c_pi|s,ell_pi) 计算为技能的可用性（来自状态 s 的成功概率）。
通过 p(c_pi|s,ell_pi) * p(ell_pi|i) 组合分数以选取下一个技能 pi。
迭代执行选中的技能并在上下文更新后重新查询 LLM，直到完成。
通过行为克隆（BC）或强化学习（RL）训练带文本嵌入条件的多任务策略。

实验结果

研究问题

RQ1一个具身代理能否通过将 LLM 知识扎根到现实世界的可用性来执行高层自然语言指令？
RQ2将 LLM 指导的规划与技能可用性相结合是否能改善真实机器人上的规划和执行？
RQ3该方法在厨房环境中的长时、抽象任务中如何扩展？
RQ4不同语言模型和扎根组件如何影响性能？
RQ5向系统加入新技能的影响是什么？

主要发现

PaLM-SayCan 在模拟厨房中实现 84% 的规划成功率和 74% 的执行成功率。
在真实厨房中，规划和执行分别下降到 81% 和 60%，显示出对现实世界的合理泛化。
可用性扎根加上 LLM 指导几乎使性能比未扎根的基线提高一倍。
更大的 LLM 提升性能；PaLM（540B）在整套系统的规划和执行上优于 FLAN。
消融实验表明语言扎根和可用性扎根对于强性能都是必要的。
系统可以轻松集成新技能（如抽屉操作），并维持对现有任务的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。