QUICK REVIEW

[論文レビュー] RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

Tomoyuki Kagaya, Thong Jing Yuan|arXiv (Cornell University)|Feb 6, 2024

Multi-Agent Systems and Negotiation被引用数 5

ひとこと要約

RAP は過去の経験を蓄積・動的に検索して、テキストのみおよびマルチモーダルLLMエージェントの計画を導く。テキストタスクで最先端の性能を達成し、具現化されたマルチモーダルタスクで強い改善を得る。

ABSTRACT

Owing to recent advancements, Large Language Models (LLMs) can now be deployed as agents for increasingly complex decision-making applications in areas including robotics, gaming, and API integration. However, reflecting past experiences in current decision-making processes, an innate human behavior, continues to pose significant challenges. Addressing this, we propose Retrieval-Augmented Planning (RAP) framework, designed to dynamically leverage past experiences corresponding to the current situation and context, thereby enhancing agents' planning capabilities. RAP distinguishes itself by being versatile: it excels in both text-only and multimodal environments, making it suitable for a wide range of tasks. Empirical evaluations demonstrate RAP's effectiveness, where it achieves SOTA performance in textual scenarios and notably enhances multimodal LLM agents' performance for embodied tasks. These results highlight RAP's potential in advancing the functionality and applicability of LLM agents in complex, real-world applications.

研究の動機と目的

テキストおよびマルチモーダル環境全体で、LLMエージェントの計画に過去の経験を活用する必要性を動機づける。
過去の経験を保存・検索・活用して現在の意思決定を情報提供するために、 Retrieval-Augmented Planning (RAP) フレームワークを提案する。
RAP の有用性を、テキストベースのベンチマーク（例：ALFWorld、WebShop）および具現化型ロボット工学ベンチマーク（例：Franka Kitchen、Meta-World）で実証する。
メモリ強化計画が、複数のLLMバックボーンおよびビジョン-言語モデル全体で性能を向上させることを示す。

提案手法

Memory、Reasoner、Retriever、Executor の4つのコアRAPコンポーネントを導入する。
タスク情報、全体計画、軌跡を含む、成功したタスク実行のエピソディックなログをメモリとして保存する。
現在の文脈に基づいて、Reasoner（LLMs）を用いて全体計画およびアクション計画と取得キーを生成する。
タスク類似度、計画整合性、および取得キー類似度の加重結合として取得スコアを計算し、関連するメモリを選択する。
retrieved experiences をプロンプトとして用いたインコンテキスト学習で、次のアクションを生成するために Executor（LLM）を用いる。
モデル間のメモリ転移可能性を示す（あるモデルで構築したメモリが別のモデルの評価に役立つ）。

実験結果

リサーチクエスチョン

RQ1過去の経験をどのように効果的に保存・取得して、テキストおよびマルチモーダル環境でのLLMエージェントの計画を改善できるか。
RQ2メモリ強化型プランナーは、テキストベースのベンチマークおよび具現化型ロボットベンチマークで最先端のベースラインを上回るか。
RQ3RAP は異なる言語モデルおよびビジョン-言語モデルに対して頑健か、またメモリをモデル間で転送できるか。
RQ4さまざまな環境で、どの取得戦略（act、obs、multimodal）が最も良い性能を発揮するか。

主な発見

方法(d max =3)	モデル	選択	清浄	熱	冷却	観察	選択2	全体
Act	GPT-3.5	66.7	51.6	73.9	61.9	38.9	17.6	53.7
ReAct	GPT-3.5	50.0	41.9	73.9	66.7	55.6	23.5	52.2
Reflexion	GPT-3.5	75.0	77.4	65.2	76.2	83.3	70.6	74.6
ADaPT *	GPT-3.5	87.5	80.6	60.8	76.2	61.1	52.9	71.6
RAP(Ours)	GPT-3.5	95.8	87.1	78.3	90.5	88.9	70.6	85.8
RAP train (Ours)	GPT-3.5	95.8	100.0	82.6	85.7	100.0	76.5	91.0
ReAct	GPT-4	83.3	71.0	95.7	81.0	100.0	94.1	85.8
RAP(Ours)	GPT-4	95.8	90.3	100.0	95.2	100.0	88.2	94.8
ReAct	Llama2-13b	29.2	41.9	34.8	52.4	38.9	17.6	36.6
RAP(Ours)	Llama2-13b	62.5	61.3	56.5	61.9	44.4	17.6	53.0

RAP は ALFWorld、WebShop、Franka Kitchen、Meta World のベンチマーク全体で ReAct を大幅に上回る（それぞれの改善はおおよそ 33.6%、13.0%、18.2%、12.7%）。
ALFWorld では、GPT-3.5 を用いた RAP は全体で 85.8%、RAP train はタスクを横断して 91.0% に達し、ReAct、Reflexion、ADaPT を上回る。
WebShop では、GPT-3.5 を用いた RAP は全体スコア 76.1%、成功率 48.0% を達成し、ReAct、Reflexion、ADaPT より高い。
マルチモーダルベンチマーク（Franka Kitchen、Meta-World）では、RAP強化の LLaVA および CogVLM エージェントが平均成功率で顕著な改善を示す（例：LLaVA 43.4% から 61.6%、CogVLM 44.2% から 56.9%）。
RAP はモデル間でのメモリを介した転移学習を示す（GPT-3.5 で構築したメモリが LLaMA-13B の評価を支援）。
RAP のアブレーションは、マルチモーダル retrieval keys（画像）およびタスク内/製品カテゴリの retrieval コンポーネントの利点を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。