QUICK REVIEW

[論文レビュー] Knowledge Model Prompting Increases LLM Performance on Planning Tasks

Erik W. W. Goh, John Kos|arXiv (Cornell University)|Feb 3, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

TMK-structured prompting は PlanBench Blocksworld タスクにおける LLMs/ LRMs の計画能力を向上させ、推論を記号的・コード様の実行へと向けることで、大きな正確性の向上を達成しうる。中には SoTA を上回るケースもある。

ABSTRACT

Large Language Models (LLM) can struggle with reasoning ability and planning tasks. Many prompting techniques have been developed to assist with LLM reasoning, notably Chain-of-Thought (CoT); however, these techniques, too, have come under scrutiny as LLMs' ability to reason at all has come into question. Borrowing from the domain of cognitive and educational science, this paper investigates whether the Task-Method-Knowledge (TMK) framework can improve LLM reasoning capabilities beyond its previously demonstrated success in educational applications. The TMK framework's unique ability to capture causal, teleological, and hierarchical reasoning structures, combined with its explicit task decomposition mechanisms, makes it particularly well-suited for addressing language model reasoning deficiencies, and unlike other hierarchical frameworks such as HTN and BDI, TMK provides explicit representations of not just what to do and how to do it, but also why actions are taken. The study evaluates TMK by experimenting on the PlanBench benchmark, focusing on the Blocksworld domain to test for reasoning and planning capabilities, examining whether TMK-structured prompting can help language models better decompose complex planning problems into manageable sub-tasks. Results also highlight significant performance inversion in reasoning models. TMK prompting enables the reasoning model to achieve up to an accuracy of 97.3\% on opaque, symbolic tasks (Random versions of Blocksworld in PlanBench) where it previously failed (31.5\%), suggesting the potential to bridge the gap between semantic approximation and symbolic manipulation. Our findings suggest that TMK functions not merely as context, but also as a mechanism that steers reasoning models away from their default linguistic modes to engage formal, code-execution pathways in the context of the experiments.

研究の動機と目的

標準 prompting を超えた LLM の計画と推論の改善を動機づける。
Task-Method-Knowledge (TMK) フレームワークを用いてプロンプトを構造化する。
PlanBench Blocksworld バリアント上で TMK prompting を評価し、記号的推論と言語的推論を比較する。
TMK が記号的実行へ推論を移行させる認知的足場として機能するかを評価する。

提案手法

Blocksworld ドメインの手順を TMK（Task, Method, Knowledge）JSON 構造に変換する。
PlanBench ドメインのプロンプトを one-shot 設定で TMK 形式のプロンプトに置換する。
Classic、Mystery、Random の PlanBench バリアントに対して zero-shot/one-shot 比較で TMK プロンプトを評価する。
TMK プ prompts の正確性と、異なるモデルファミリー（LLM、LRM）への影響を分析する。
TMK の性能を PlanBench の最先端結果と比較し、データセット間の頑健性を検討する。

実験結果

リサーチクエスチョン

RQ1TMK 構造化プロンプティングは PlanBench Blocksworld 問題の計画正確性を改善するか？
RQ2TMK プロンプティングはモデルを言語的手掛かりから記号的/コード様推論へと移行させるか？
RQ3Classic、Mystery、Random の Blocksworld バリアントおよびモデル種別を横断して TMK はどのように機能するか？
RQ4観測された性能反転やドメイン固有効果の原因は何か？
RQ5旗艦モデルと LRMs の双方で利益は一貫しているか、制約は何か？

主な発見

Model	Type	Plain Text (%)	TMK (%)
GPT-4	Classic	34.6	39.7
GPT-4	Mystery	0	3.8
GPT-4	Random	0	4.17
GPT-4o	Classic	35.5	45.3
GPT-4o	Mystery	0	5.5
GPT-4o	Random	0.83	4.83
o1mini	Classic	56.7	57
o1mini	Mystery	19.1	16.83
o1mini	Random	9.33	27.0
o1preview	Classic	97.8	NA
o1preview	Mystery	52.8	NA
o1preview	Random	37.3	NA
o1	Classic	95.7	98.5
o1	Mystery	74.3	83.3
o1	Random	31.5	97.33
GPT5	Classic	99.3	99.7
GPT5	Mystery	98.1	98.3
GPT5	Random	92.5	99.0

TMK プロンプティングは Blocksworld バリアントに対して旗艦モデルの計画正確性を一般に向上させる。
Random Blocksworld では o1 モデルで最大 65.8 ポイントの正解率上昇（31.5% から 97.3% へ）を達成。
TMK が一部のモデルを意味論的/ミステリー領域から強力な記号的性能へと転換させる性能反転がある（特に Random Blocksworld の o1 で顕著）。
GPT-4 および GPT-4o は Classic、Mystery、Random の各バリアントで控えめな改善を示す（例：Classic で GPT-4 の 34.6% から 39.7% への増加など）。
TMK の効果は基礎計画性能が弱いモデルで最も強いが、GPT-5 のような強力なモデルでも TMK プロンプティング下で非自明な利益を示す。
o1-mini では TMK の利益は混在し、Mystery ドメインでの劣化を含むため、意味的干渉の解消における容量限界を示唆。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。