[论文解读] PAL: Program-aided Language Models
PaL 使用一个大型语言模型在中间推理步骤中生成交错的自然语言和Python代码,然后将求解交给Python解释器以产生最终答案。它在13个数学、符号和算法任务上实现了最先进的few-shot准确性,常常在使用链式思维提示的情况下超越更大模型。
Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is decomposed correctly. In this paper, we present Program-Aided Language models (PAL): a novel approach that uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter. We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and other benchmarks. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1. Our code and data are publicly available at http://reasonwithpal.com/ .
研究动机与目标
- 通过将问题分解与解决执行分离来促发对LLM的稳健推理。
- 证明生成可执行代码作为推理相较链式思维在多样基准上提升准确性。
- 展示PaL在gsm8k上实现最先进few-shot性能,在符号/算法任务上具有竞争性提升。
提出的方法
- LLM 生成交错的自然语言和编程语言步骤作为中间推理(t = NL ∪ PL)。
- 最终答案通过在Python解释器中执行生成的代码来产生;LLM在提示中不输出最终答案。
- 提示用有意义的变量名和可选的NL注释进行构建;若需要,可将执行结果反馈回去(实验中使用事后执行)。
- 实验比较Direct prompting、链式思维 prompting和PaL在数学、符号和算法任务上的表现,基于Codex作为基础LLM。
- 评估包括固定提示few-shot设置,对于gsm8k,使用多数投票采样(k>1)以提升准确性。
实验结果
研究问题
- RQ1程序生成的推理步骤结合外部解释器是否能在数学、符号和算法任务上超越链式思维提示?
- RQ2PaL 在大数运算上的表现以及对提示变体(变量名、NL注释)的鲁棒性在多样基准上如何?
- RQ3PaL 的好处是来自代码生成提示、解释器,还是两者的组合?
- RQ4PaL 相较于在标准基准如gsm8k上使用链式思维训练的大型LLM有何差异?
主要发现
| 任务 | Direct (Codex) | CoT (UL2-20B) | CoT (LaMDA-137B) | CoT (Codex) | CoT (PaLM-540b) | CoT (Minerva 540B) | PaL (Codex) |
|---|---|---|---|---|---|---|---|
| gsm8k | 19.7 | 4.1 | 17.1 | 65.6 | 56.9 | 58.8 | 72.0 |
| gsm-hard | 5.0 | - | - | 23.1 | - | - | 61.2 |
| svamp | 69.9 | 12.6 | 39.9 | 74.8 | 79.0 | - | 79.4 |
| asdiv | 74.0 | 16.9 | 49.0 | 76.9 | 73.9 | - | 79.6 |
| singleeq | 86.8 | - | - | 89.1 | 92.3 | - | 96.1 |
| singleop | 93.1 | - | - | 91.9 | 94.1 | - | 94.6 |
| addsub | 90.9 | - | - | 86.0 | 91.9 | - | 92.5 |
| multiarith | 44.0 | 10.7 | 51.8 | 95.9 | 94.7 | - | 99.2 |
- 使用Codex的PaL在gsm8k上实现了最先进的few-shot准确性,明显优于使用链式思维的大模型。
- 在gsm-hard(大数字)上,PaL保持鲁棒,而直接提示和CoT显著下降。
- 在BIG-Bench Hard的符号推理和算法任务中,PaL显著超越CoT,并缩小通过高准确度解决任务的差距(例如有色对象、企鹅、日期、对象计数、重复复制)。
- PaL的改进在较弱的基础LM以及主要用于自然语言训练但具备足够代码建模能力的LM上也持续存在。
- 消融显示有意义的变量名和NL注释有助于PaL的有效性;移除它们会降低性能。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。