QUICK REVIEW

[論文レビュー] On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

Karthik Valmeekam, Sarath Sreedharan|arXiv (Cornell University)|Feb 13, 2023

Natural Language Processing Techniques被引用数 31

ひとこと要約

この論文は、Blocksworldのようなタスクに対してLLMの自律的計画、ヒューリスティック指針、およびヒューマン・イン・ザ・ループの性能を体系的に評価するベンチマークを提示し、自治的な計画はほとんど効果がない（約3% の成功率）一方でプランナーは特定のモードでLLMの提案を修復または活用できることを見出しています。

ABSTRACT

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating and validating simple plans in commonsense planning tasks (of the type that humans are generally quite good at) and (2) how good LLMs are in being a source of heuristic guidance for other agents--either AI planners or human planners--in their planning tasks. To investigate these questions in a systematic rather than anecdotal manner, we start by developing a benchmark suite based on the kinds of domains employed in the International Planning Competition. On this benchmark, we evaluate LLMs in three modes: autonomous, heuristic and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate. The heuristic and human-in-the-loop modes show slightly more promise. In addition to these results, we also make our benchmark and evaluation tools available to support investigations by research community.

研究の動機と目的

LLMが外部支援なしに、常識的な計画タスクにおいて実行可能な計画を生成および検証できるかを評価する。
LLMが他のプランナーに対して有用なヒューリスティック指針を提供できるかを評価する。
LLMが生成した計画または提案を用いたときのヒューマン・イン・ザ・ループの利点を評価する。
再現性のある計画関連の研究のために、自動化された公開ベンチマークと評価ツールを提供する。

提案手法

国際計画競技会のドメインを模したベンチマークスイートを開発し、計画の生成と検証をテストする。
自治的、ヒューリスティック、およびヒューマン・イン・ザ・ループの三つのモードでLLMsを評価する。
PDDL風のドメインモデリングとテンプレートベースの自然言語翻訳器を用いて、シンボリックな計画とテキスト promptsを結びつける。
自動化されたプランナー（LPG）と計画検証ツールを用いて実行可能性と計画品質を測定する自動評価を行う。
Blocksworldにテストケースを据え付け、標準指標（正確性、最適性など）を用いてプランナーの性能を分析する。
研究利用のためにベンチマークとツールを公開する。

実験結果

リサーチクエスチョン

RQ1LLMsは常識的な計画領域で自動的に実行可能な計画を生成できるか？
RQ2他のプランナーのヒューリスティック指針の源として使用した場合、計画タスクを改善できるか？
RQ3LLM生成の計画はヒューマンプランナーの解決を助けるか、あるいは妨げるか？
RQ4目標の再設定、計画の再利用、再計画がLLM支援計画に及ぼす影響はどうなるか？

主な発見

Task	Instances correct	GPT-3	Instruct-GPT3
Plan Generation	6/600 (1%)	41/600 (6.8%)	4/250 (1.6%)
Optimal Planning	2/600 (0.3%)	35/600 (5.8%)	3/150 (2%)
Replanning	47/600 (7.8%)	40/600 (6.6%)	3/100 (3%)
Plan Generalization	33/500 (6.6%)	49/500 (9.8%)	11/100 (11%)
Plan Reuse	0/600 (0%)	102/600 (17%)	0/100 (0%)
Robustness to Goal Reformulation (Shuffling)	460/600 (76.6%)	467/600 (77.8%)	21/100 (21%)
Robustness to Goal Reformulation (Full→ Partial)	407/600 (67.8%)	467/600 (77.8%)	9/100 (9%)
Robustness to Goal Reformulation (Partial→ Full)	122/600 (20.3%)	363/600 (60.5%)	5/100 (5%)

LLMsは自律的計画の成功率が非常に低く、生成された計画の実行可能性は平均約3%程度でしかない。
ヒューリスティックモードでは、LLMが提案した計画を自動化されたプランナー（LPG）によって比較的少ない労力で正しい計画へ修復できる。
LLMの提案を含むヒューマン・イン・ザ・ループは時間や認知的負荷の統計的有意差のある低下をもたらすには至らなかったが、限定的な改善を示す。
ヒューリスティックおよび特定の目標再設定タスクでは、LLMsは顕著に良くなるものの、全体的には自律的計画能力は限定的であることが示される。
Blocksworldタスクにおける人間のベースラインは、人間が有効でしばしば最適な計画を生成でき、自治的生成におけるLLMsを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。