[论文解读] Knowledge Model Prompting Increases LLM Performance on Planning Tasks
TMK-结构化提示在 PlanBench Blocksworld 任务上改善了大型语言模型和语言替身模型的规划能力,通常实现显著的准确性提升,在某些情况下甚至超过最先进方法,通过将推理引导到符号化、类似代码的执行。
Large Language Models (LLM) can struggle with reasoning ability and planning tasks. Many prompting techniques have been developed to assist with LLM reasoning, notably Chain-of-Thought (CoT); however, these techniques, too, have come under scrutiny as LLMs' ability to reason at all has come into question. Borrowing from the domain of cognitive and educational science, this paper investigates whether the Task-Method-Knowledge (TMK) framework can improve LLM reasoning capabilities beyond its previously demonstrated success in educational applications. The TMK framework's unique ability to capture causal, teleological, and hierarchical reasoning structures, combined with its explicit task decomposition mechanisms, makes it particularly well-suited for addressing language model reasoning deficiencies, and unlike other hierarchical frameworks such as HTN and BDI, TMK provides explicit representations of not just what to do and how to do it, but also why actions are taken. The study evaluates TMK by experimenting on the PlanBench benchmark, focusing on the Blocksworld domain to test for reasoning and planning capabilities, examining whether TMK-structured prompting can help language models better decompose complex planning problems into manageable sub-tasks. Results also highlight significant performance inversion in reasoning models. TMK prompting enables the reasoning model to achieve up to an accuracy of 97.3\% on opaque, symbolic tasks (Random versions of Blocksworld in PlanBench) where it previously failed (31.5\%), suggesting the potential to bridge the gap between semantic approximation and symbolic manipulation. Our findings suggest that TMK functions not merely as context, but also as a mechanism that steers reasoning models away from their default linguistic modes to engage formal, code-execution pathways in the context of the experiments.
研究动机与目标
- 推动在标准提示之外提升 LLM 的规划与推理能力。
- 利用任务-方法-知识(TMK)框架来结构化提示。
- 在 PlanBench Blocksworld 变体中评估 TMK 提示,以测试符号化与语言推理的差异。
- 评估 TMK 是否充当认知支架,将推理向符号化执行转移。
提出的方法
- 将 Blocksworld 域的步骤转换为 TMK(任务、方法、知识)JSON 结构。
- 在单发设置中用 TMK 格式的提示替换 PlanBench 域的提示。
- 对 PlanBench 变体(Classic、Mystery、Random)使用零-shot/一-shot 比较评估 TMK 提示。
- 分析准确性以及 TMK 对不同模型族(LLMs 与 LRM)的影响。
- 将 TMK 的表现与最先进的 PlanBench 结果进行比较,并讨论在不同数据集上的鲁棒性。
实验结果
研究问题
- RQ1TMK 结构化提示是否提高 PlanBench Blocksworld 问题的规划准确性?
- RQ2TMK 提示是否将模型从语言线索转向符号化/代码化推理?
- RQ3TMK 在 Classic、Mystery、Random Blocksworld 变体以及不同模型类型上的表现如何?
- RQ4造成观察到的性能倒置或领域特定效应的原因是什么?
- RQ5在旗舰模型和 LRMs 上 gains 是否具有一致性,存在的局限性是什么?
主要发现
| Model | Type | Plain Text (%) | TMK (%) |
|---|---|---|---|
| GPT-4 | Classic | 34.6 | 39.7 |
| GPT-4 | Mystery | 0 | 3.8 |
| GPT-4 | Random | 0 | 4.17 |
| GPT-4o | Classic | 35.5 | 45.3 |
| GPT-4o | Mystery | 0 | 5.5 |
| GPT-4o | Random | 0.83 | 4.83 |
| o1mini | Classic | 56.7 | 57 |
| o1mini | Mystery | 19.1 | 16.83 |
| o1mini | Random | 9.33 | 27.0 |
| o1preview | Classic | 97.8 | NA |
| o1preview | Mystery | 52.8 | NA |
| o1preview | Random | 37.3 | NA |
| o1 | Classic | 95.7 | 98.5 |
| o1 | Mystery | 74.3 | 83.3 |
| o1 | Random | 31.5 | 97.33 |
| GPT5 | Classic | 99.3 | 99.7 |
| GPT5 | Mystery | 98.1 | 98.3 |
| GPT5 | Random | 92.5 | 99.0 |
- TMK 提示通常在 Blocksworld 变体上提升旗舰模型的规划准确性。
- 在 Random Blocksworld 上,TMK 对 o1 模型的提升达到了最高 65.8 个百分点(从 31.5% 提升到 97.3%)。
- 存在性能倒置,即 TMK 将某些模型从语义/神秘域转向强符号性表现,尤其是对 Random Blocksworld 的 o1。
- GPT-4 与 GPT-4o 在 Classic、Mystery、Random 变体上显示出适度改善(例如 Classic GPT-4 从 34.6% 提升到 39.7% 等)。
- TMK 效应在基线规划性能较弱的模型上最强;然而,即使是如 GPT-5 这样的强模型,在 TMK 提示下也能获得不小的增益。
- 对于 o1-mini,TMK 的收益参差不齐,在 Mystery 域存在某些退化,表明在解除语义干扰方面存在容量限制。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。