QUICK REVIEW

[論文レビュー] Embodied Task Planning with Large Language Models

Zhenyu Wu, Ziwei Wang|arXiv (Cornell University)|Jul 4, 2023

Multimodal Machine Learning Applications被引用数 18

ひとこと要約

TaPA は、LLM からの実行可能タスク計画を多視点視覚認識とオープン語彙の物体検出を用いて実世界の室内シーンに結び付け、複雑な具象的タスクでベースラインを上回る。

ABSTRACT

Equipping embodied agents with commonsense is important for robots to successfully complete complex human instructions in general environments. Recent large language models (LLM) can embed rich semantic knowledge for agents in plan generation of complex tasks, while they lack the information about the realistic world and usually yield infeasible action sequences. In this paper, we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint, where the agent generates executable plans according to the existed objects in the scene by aligning LLMs with the visual perception models. Specifically, we first construct a multimodal dataset containing triplets of indoor scenes, instructions and action plans, where we provide the designed prompts and the list of existing objects in the scene for GPT-3.5 to generate a large number of instructions and corresponding planned actions. The generated data is leveraged for grounded plan tuning of pre-trained LLMs. During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations. Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin, which indicates the practicality of embodied task planning in general and complex environments.

研究の動機と目的

現実的な環境で複雑な人間の指示を実行するために、常識知識を備えた具象エージェントの実現を動機づける。
知覚的シーン情報と整合する grounded planning アプローチを開発し、実行可能な行動列を生成する。
LLM ベースのタスクプランナーを微調整するため、シーン・指示・実行可能プランの大規模マルチモーダルデータセットを作成する。
多様な屋内部屋で、最先端の LLM/LMM と比較して TaPA フレームワークを評価し、実用可能性を示す。

提案手法

シーンオブジェクトリストをプロンプトとして用い、GPT-3.5 でシーン・指示・実行可能プランの3要素データをマルチモーダルデータセットとして構築する。
オブジェクトリストと指示サンプルから行動ステップを予測するよう、事前学習済み LLaMA モデルを微調整する。
オープンボキャブラリ物体検出器をマルチビュー RGB 画像へ拡張し、推論時に堅牢なシーンオブジェクトリストを取得する。
予測されたオブジェクトリストを人間の指示と統合して実行可能な行動列を生成する。
情報の豊富さと物体検出のノイズのバランスを取るため、マルチビューやグリッドベースの画像収集戦略を評価する。

実験結果

リサーチクエスチョン

RQ1LLMベースのプランナーは、現実世界のシーン制約を尊重した実行可能な行動列を生成できるか？
RQ2マルチモーダルデータ生成と grounded perception は、さまざまな部屋における計画の妥当性と成功率にどう影響するか？
RQ3マルチビュー物体検出と画像収集戦略が、 grounded task planning の性能にどのような影響を与えるか？

主な発見

方法	キッチン	リビング	ベッド	バス	平均
LLaVA	14.29	42.11	33.33	0.00	22.43
GPT-3.5	28.57	73.68	66.67	50.00	54.73
LLaMA	0.00	10.52	13.33	0.00	5.96
TaPA	28.57	84.21	73.33	58.33	61.11

TaPA は、キッチン・リビング・ベッド・バスのシーンで LLaVA、GPT-3.5、LLaMA より高い平均成功率を達成する。
TaPA average success rate: 61.11% versus 54.73% (GPT-3.5) in the reported evaluation.
TaPA shows lower hallucination and counterfactual rates than competing models when generating executable plans.
Block-wise center point image collection yields the best balance of scene coverage and minimal redundancies, improving planning success.
Open-vocabulary detection combined with finetuned instruction following yields more executable plans than single-image baselines like LLaVA.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。