QUICK REVIEW

[论文解读] On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

Karthik Valmeekam, Sarath Sreedharan|arXiv (Cornell University)|Feb 13, 2023

Natural Language Processing Techniques被引用 31

一句话总结

本文提出一个基准，用以系统评估大型语言模型（LLMs）在类似Blocksworld的任务中的自主规划、启发式引导和人机交互性能，结果显示自主规划基本无效（约3% 成功率），而规划器在某些模式下能够修正或利用LLM的建议。

ABSTRACT

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating and validating simple plans in commonsense planning tasks (of the type that humans are generally quite good at) and (2) how good LLMs are in being a source of heuristic guidance for other agents--either AI planners or human planners--in their planning tasks. To investigate these questions in a systematic rather than anecdotal manner, we start by developing a benchmark suite based on the kinds of domains employed in the International Planning Competition. On this benchmark, we evaluate LLMs in three modes: autonomous, heuristic and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate. The heuristic and human-in-the-loop modes show slightly more promise. In addition to these results, we also make our benchmark and evaluation tools available to support investigations by research community.

研究动机与目标

评估在无外部帮助的常识性规划任务中，LLMs 生成并验证可执行计划的能力。
评估 LLMs 是否能为传统规划器提供有用的启发式指导。
评估在使用 LLM 生成的计划或建议时人机交互的收益。
提供一个自动化、公开的基准和评估工具，以便在可重复的规划相关的LLM研究中使用。

提出的方法

开发一个受 International Planning Competition 领域启发的基准套件，用于测试计划生成与验证。
在三种模式下评估LLMs：自主、启发式和人机交互。
使用PDDL风格的域建模和基于模板的自然语言翻译器，将符号化的计划与文本提示连接起来。
通过自动规划器（LPG）和计划校验器实现自动化评估，以衡量可执行性和计划质量。
将测试用例基于 Blocksworld，并使用标准指标（如正确性、最优性）分析规划器的性能。
使基准与工具公开供研究使用。

实验结果

研究问题

RQ1LLMs 是否能够在常识性规划领域自主生成可执行的计划？
RQ2当用作其他规划器的启发式引导来源时，LLMs 是否能改进规划任务？
RQ3LLM 生成的计划是帮助还是阻碍人类规划者解决规划任务？
RQ4目标改写、计划重用和再规划对LLM辅助规划的影响是什么？

主要发现

任务	正确的实例	GPT-3	Instruct-GPT3
Plan Generation	6/600 (1%)	41/600 (6.8%)	4/250 (1.6%)
Optimal Planning	2/600 (0.3%)	35/600 (5.8%)	3/150 (2%)
Replanning	47/600 (7.8%)	40/600 (6.6%)	3/100 (3%)
Plan Generalization	33/500 (6.6%)	49/500 (9.8%)	11/100 (11%)
Plan Reuse	0/600 (0%)	102/600 (17%)	0/100 (0%)
Robustness to Goal Reformulation (Shuffling)	460/600 (76.6%)	467/600 (77.8%)	21/100 (21%)
Robustness to Goal Reformulation (Full→ Partial)	407/600 (67.8%)	467/600 (77.8%)	9/100 (9%)
Robustness to Goal Reformulation (Partial→ Full)	122/600 (20.3%)	363/600 (60.5%)	5/100 (5%)

LLMs 在自主规划方面的成功率非常低，生成的计划可执行性平均约为 3%。
启发式模式使得LLM建议的计划能够被自动规划器（LPG）修正为正确的计划，所需努力相对较少。
在LLM建议下的人机交互带来适度的改进，但在时间或认知负荷方面尚未达到统计显著的降低。
在启发式和某些目标改写任务中，LLM表现显著更好，但总体结果仍表明自主规划能力有限。
Blocksworld 任务的人类基线显示人类能够产生有效且常常是最优的计划，在自主生成方面优于LLMs。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。