QUICK REVIEW

[論文レビュー] Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents

Wenlong Huang, Fei Xia|arXiv (Cornell University)|Mar 1, 2023

Topic Modeling被引用数 13

ひとこと要約

Grounded Decoding (GD) は、固定された Large Language Model をドメイン固有の grounding モデルと結びつけ、具現化ロボットのオープン語彙の計画をデコードします。LLM をファインチューニングすることなく、長期的なタスクを可能にします。GD はトークンレベルの結合確率を用いて、計画が意味論的に妥当で物理的に実現可能であることを保証します。

ABSTRACT

Recent progress in large language models (LLMs) has demonstrated the ability to learn and leverage Internet-scale knowledge through pre-training with autoregressive models. Unfortunately, applying such models to settings with embodied agents, such as robots, is challenging due to their lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints that robots may require. On the other hand, language-conditioned robotic policies that learn from interaction data can provide the necessary grounding that allows the agent to be correctly situated in the real world, but such policies are limited by the lack of high-level semantic understanding due to the limited breadth of the interaction data available for training them. Thus, if we want to make use of the semantic knowledge in a language model while still situating it in an embodied setting, we must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment. We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting by leveraging the knowledge of both models. The project's website can be found at grounded-decoding.github.io.

研究の動機と目的

大規模言語モデルからの高レベルの意味論的計画と、ロボットの embodiment（具現化）と環境からの grounding 情報を橋渡しする。
LLMの確率と grounding モデルの目的（アフォーダンス、安全性、嗜好）を組み合わせたトークンレベルのデコード戦略を開発する。
複数ドメイン（シミュレーションと現実世界）に跨る適用性を示し、従来手法に対する効率向上を実証する。

提案手法

embodiment state s に条件づけられた grounding 関数 pG(w1...n|s) を定義する。
GDを、pGD(w1...N|s, l) を最大化するトークンレベルの自己回帰デコーディングとして定式化する：pGD(w1...N|s, l) ∝ pLLM(wn|w1...n-1, l) · pG(w1...n|s)。
GDを、グリーディ法またはビーム探索で実装し、LLMとgroundedモデルの両方で尤もらしいトークンを選択する。
ドメインデータから grounding 信号を学習する（トークン条件付きの値関数、マルチモーダル検出器、ルールベースの手掛かり）し、それらを組み合わせる（アフォーダンス、安全性、嗜好）。
デコーディング時に視覚と言語モデルを活用するため、プロンプトやチェーンオブソートのテクニックを用いてマルチモーダル groundingを任意で有効にする。

実験結果

リサーチクエスチョン

RQ1オープン語彙の言語モデルを、ロボットの具現化状態に grounding して、実行可能な長期計画を生成するにはどうすればよいか。
RQ2アフォーダンス、安全性、嗜好、マルチモーダル検出器など、どの grounding 信号が、無 grounding のデコードと比較して長期タスクの成功を改善するか。
RQ3トークンレベルの Grounded Decoding は、LLMをファインチューニングせずにオープンアクション空間および複数ドメイン（シミュレーションと現実世界）へ効率的にスケールできるか。
RQ4計画と実行の効率、および見たことのないタスクへの一般化という点で、GDはSayCanとどう比較されるか。

主な発見

GDは、LLM計画をドメイン groundingと統合することで、3つの embodiment ドメイン全体で強力な性能を達成する。
ビーム探索は、グリーディデコードより性能を向上させ、特に長期的なタスクで顕著である。
GDは、 SayCan より2桁の効率向上を実現しつつ、性能は同程度である。
grounding を介したアフォーダンス、安全性、嗜好は、行動空間を絞り込み、未 grounding または純粋に LLM ベースのアプローチと比較して計画の失敗を減らす。
チェーンオブソートプロンプトとマルチモーダル grounding の組み合わせは、現実世界設定でタスクの曖昧さを解消し、計画と実行を改善するのに役立つ。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。