[論文レビュー] PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents
Pearl は GPT-4 にデータからアクションを抽出し、計画を作成し、それを段階的に実行して長い文書を推論することで、長文コンテキストのサブセットにおいてゼロショットおよびチェーン・オブ・思考ベースラインを上回る。
Strategies such as chain-of-thought prompting improve the performance of large language models (LLMs) on complex reasoning tasks by decomposing input examples into intermediate steps. However, it remains unclear how to apply such methods to reason over long input documents, in which both the decomposition and the output of each intermediate step are non-trivial to obtain. In this work, we propose PEARL, a prompting framework to improve reasoning over long documents, which consists of three stages: action mining, plan formulation, and plan execution. More specifically, given a question about a long document, PEARL decomposes the question into a sequence of actions (e.g., SUMMARIZE, FIND_EVENT, FIND_RELATION) and then executes them over the document to obtain the answer. Each stage of PEARL is implemented via zero-shot or few-shot prompting of LLMs (in our work, GPT-4) with minimal human input. We evaluate PEARL on a challenging subset of the QuALITY dataset, which contains questions that require complex reasoning over long narrative texts. PEARL outperforms zero-shot and chain-of-thought prompting on this dataset, and ablation experiments show that each stage of PEARL is critical to its performance. Overall, PEARL is a first step towards leveraging LLMs to reason over long documents.
研究の動機と目的
- 長文ドキュメントの分解が容易でない複雑な推論を動機づけ、対処する。
- 長い入力に対して計画を構築・実行するための3段階の prompting フレームワーク(action mining、plan generation、plan execution)を導入する。
- 長文形式の物語QAデータセットである QuALITY の難解なサブセットで有効性を示す。
- 各 Pearl 段階の必然性を検証するアブレーションを示し、エラー源を分析する。
提案手法
- Action mining: an LLM generates a dataset-specific set of basic reasoning actions from seed demonstrations.
- Plan generation: given a question, an LLM creates an executable plan selecting actions from the mined set, formatted as a simple program.
- Plan execution: the LLM executes the plan action-by-action over the long document using a structured prompt template.
- Self-correction and self-refinement: plan syntax is corrected and plan demonstrations are refined based on task-specific evaluation.
- Baselines: compare Pearl against zero-shot GPT-4, GPT-3.5, and zero-shot chain-of-thought prompting; ablations remove plan execution or self-refinement to assess impact.

実験結果
リサーチクエスチョン
- RQ1Can LLMs reason effectively over long documents by decomposing tasks into executable actions and plans?
- RQ2Does a learned, data-driven action set plus plan execution improve long-context reasoning compared to direct prompting?
- RQ3How critical are each Pearl stage (action mining, plan generation, plan execution, self-correction/refinement) to final performance?
- RQ4What are the main error modes and how does plan execution influence answer quality over long inputs?
主な発見
| Method | Long | Short | All | p-val |
|---|---|---|---|---|
| GPT-4 zero-shot | 64.3 | 64.3 | 68.8 | - |
| GPT-3.5 zero-shot | 45.5 | 56.3 | 48.8 | 0.000 |
| GPT-4 zero-shot chain-of-thought | 65.9 | 77.2 | 69.3 | 0.766 |
| GPT-4 Pearl | 70.9 | 77.8 | 73.0 | 0.005 |
| Ablation: w/o plan execution | 67.3 | 77.2 | 70.3 | 0.295 |
| Ablation: w/o self-refinement of plan demonstrations | 67.0 | 78.8 | 70.6 | 0.245 |
- Pearl outperforms zero-shot GPT-4 and zero-shot chain-of-thought prompting on long-context questions from QuALITY.
- Increasing the action set size helps up to an optimum; too many actions degrades performance due to execution difficulty.
- Plan execution over the long document is necessary for observed gains; removing execution reduces accuracy by ~3 points but still can beat some baselines.
- Self-refinement of plan demonstrations is crucial for performance, with significant drops observed when removed.
- Human evaluation finds most generated plans reasonable, with some unnecessary steps and occasional omissions.
- Pearl substantially improves reasoning types involving causes, person-centric questions, and not/except questions; differences are statistically significant in several cases.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。