QUICK REVIEW

[論文レビュー] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael J. Ahn, Anthony Brohan|arXiv (Cornell University)|Apr 4, 2022

Multimodal Machine Learning Applications被引用数 512

ひとこと要約

本論文は、SayCan というフレームワークを提案します。高レベルの計画を、事前学習済みスキルからの学習可能なアフォーダンスと結びつけることで、LLMs をロボティクスに grounding し、モバイルマニピュレータ上で現実世界・長期目標指示の実行を可能にします。

ABSTRACT

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.

研究の動機と目的

LLMs が現実世界での根拠を欠き、具象エージェント上でのデプロイ時に失敗しうることを動機づける。
事前学習済みスキルからの世界認識アフォーダンスを用いて、LLMの出力を grounding する案を提案する。
ロボットの環境で実行可能な、解釈可能で段階的な実行計画を可能にする。
モバイルロボットを用いた長期のキッチン作業で現実世界の性能を実証する。

提案手法

各低レベルスキルをポリシーと TD 学習済み価値関数（アフォーダンス）で表現する。
指示 i に対して各スキル記述 ell_pi から p(ell_pi|i) を LLM から計算する。
p(c_pi|s,ell_pi) をスキルのアフォーダンス（状態 s からの成功確率）として計算する。
p(c_pi|s,ell_pi) * p(ell_pi|i) によってスコアを結合し、次のスキル pi を選択する。
選択されたスキルを反復的に実行し、完了まで更新された文脈で LLM を再問い合わせする。
テキスト埋め込みを条件としたマルチタスク設定で、行動クローン（BC）または強化学習（RL）により言語条件付きポリシーを訓練する。

実験結果

リサーチクエスチョン

RQ1具象エージェントが現実世界のアフォーダンスに基づいて LLM の知識を grounding することで、ハイレベルな自然言語指示を実行できるだろうか？
RQ2LLM ガイド付き計画とスキルアフォーダンスを組み合わせることで、実ロボットにおける計画と実行が改善されるか？
RQ3キッチン環境で長期・抽象的なタスクに対してこのアプローチはどの程度スケールするか？
RQ4異なる言語モデルと grounding コンポーネントは性能にどのように影響するか？
RQ5新しいスキルをシステムに追加することの影響は何か？

主な発見

PaLM-SayCan は模擬キッチンで計画成功率 84%、実行成功率 74% を達成。
実際のキッチンでは、計画が81%、実行が60%に低下し、現実世界への合理的な一般化を示す。
アフォーダンス grounding と LLM のガイダンスは、 grounding なしのベースラインと比べてほぼ2倍の性能を実現。
より大きな LLM は性能を向上させる。PaLM (540B) は全体システムで計画と実行の両方で FLAN を上回る。
アブレーションにより、言語 grounding とアフォーダンス grounding の双方が高性能に必要であることが示される。
システムは新しいスキルを容易に統合できる（例: 引き出しの操作）と、既存タスクの性能を維持できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。