QUICK REVIEW

[論文レビュー] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Karthik Valmeekam, Marquez, Matthew|arXiv (Cornell University)|Jun 21, 2022

Natural Language Processing Techniques被引用数 49

ひとこと要約

PlanBenchは、IPCスタイルの計画ドメイン（BlocksworldとLogistics）を用いて、変更についての計画と推論を評価する拡張可能なベンチマークスイートです。結果は、GPT-4やInstructGPT-3のような現在のLLMが多くの計画タスクで苦戦していることを示しています。

ABSTRACT

Generating plans of action, and reasoning about change have long been considered a core competence of intelligent agents. It is thus no surprise that evaluating the planning and reasoning capabilities of large language models (LLMs) has become a hot topic of research. Most claims about LLM planning capabilities are however based on common sense tasks-where it becomes hard to tell whether LLMs are planning or merely retrieving from their vast world knowledge. There is a strong need for systematic and extensible planning benchmarks with sufficient diversity to evaluate whether LLMs have innate planning capabilities. Motivated by this, we propose PlanBench, an extensible benchmark suite based on the kinds of domains used in the automated planning community, especially in the International Planning Competition, to test the capabilities of LLMs in planning or reasoning about actions and change. PlanBench provides sufficient diversity in both the task domains and the specific planning capabilities. Our studies also show that on many critical capabilities-including plan generation-LLM performance falls quite short, even with the SOTA models. PlanBench can thus function as a useful marker of progress of LLMs in planning and reasoning.

研究の動機と目的

PlanBenchを紹介する、さまざまなドメインに跨るLLMの計画と推論を評価する拡張可能なベンチマーク。
ベンチマークを古典的計画形式（PDDL）に基づかせ、LLMの出力を制約・検証する。
さまざまな計画能力を評価するためのタスクのカリキュラム（計画生成、コスト最適化計画、検証、実行推論、再計画、一般化、再利用、頑健性）。
プロンプト生成と計画検証のための、ドメインに依存しない評価フレームワークと、ドメイン依存の要素（ドメイン、問題生成、翻訳）を示す。
ドメイン名の頑健性を検証するための、難読化されたドメインを含む約26,250件のプロンプトデータセットを提供する。
最先端のLLMを用いたベースライン結果を示し、現在の能力を定量化して今後の改善を指針とする。

提案手法

symbolicプランナーとプラン検証者に基づくドメインに依存しない計画評価フレームワークを定義する。
liftedドメインモデル、問題生成器、自然言語と正式表現への翻訳を用いたドメイン依存要素を用いる。
PDDL風の問題を自然言語プロンプトへ翻訳し、LLM出力を再度計画表現へ解析して検証する。
IPCドメインを横断する8つのテストケース（7つの計画/推論タスクと計画生成を含む）を、Few-shotプロンプトと計画終端マーカーで抽出する。
ドメインをシャッフルして、表面的な命名に依存するのか、基盤となる推論パターンに依存するのかを検証する。
プロンプトと結果のためのツール、データセット、再現スクリプトを公開リポジトリとして提供する。

実験結果

リサーチクエスチョン

RQ1LLMsは一般的な常識的な計画ドメインで、明示的な目標を達成する有効な計画を生成できるか。
RQ2LLMsはコスト最適な計画を作成し、制約下で計画の妥当性を検証できるか。
RQ3LLMsは計画の実行を推論し、行動の結果を予測できるか。
RQ4目標の再設定、計画の再利用、変更下での再計画に対する頑健性はどの程度か。
RQ5学習した計画パターンを新しい事例や難読化されたドメインにどの程度一般化できるか。

主な発見

タスク	正解の事例	GPT-4	Instruct-GPT3
計画生成	206/600	34.3%	6.8%
費用最適化計画	198/600	33%	5.8%
計画検証	352/600	58.6%	12%
計画実行の推論	191/600	31.8%	0.6%
再計画	289/600	48.1%	6.6%
計画の一般化	141/500	28.2%	9.8%
計画の再利用	392/600	65.3%	17%
目標再定式化の頑健性（シャッフル）	461/600	76.8%	77.8%
目標再定式化の頑健性（Full→Partial）	522/600	87%	77.8%
目標再定式化の頑健性（Partial→Full）	348/600	58%	60.5%

GPT-4はPlan Generationで正解34.3%（206/600）、Cost-Optimal Planningで33.0%（198/600）。
InstructGPT-3はPlan Generationで6.8%（41/600）、Cost-Optimal Planningで5.8%（35/600）。
Plan VerificationはGPT-4で58.6%正解（352/600）、InstructGPT-3は12%（72/600）。
Reasoning About Plan ExecutionはGPT-4で31.8%（191/600）だが、InstructGPT-3は0.6%（4/600）。
ReplanningはGPT-4で48.1%（289/600）、InstructGPT-3は6.6%（40/600）。
Plan Generalization: GPT-4は28.2%（141/500）、InstructGPT-3は9.8%（49/500）。
Plan ReuseはGPT-4で65.3%（392/600）、InstructGPT-3は17%（102/600）。
Robustness to Goal Reformulation (Shuffling): GPT-4は76.8%（461/600）、InstructGPT-3は77.8%（467/600）。
Robustness to Goal Reformulation (Full→Partial): GPT-4は87%（522/600）、InstructGPT-3は77.8%（467/600）。
Robustness to Goal Reformulation (Partial→Full): GPT-4は58%（348/600）、InstructGPT-3は60.5%（363/600）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。