QUICK REVIEW

[論文レビュー] PAL: Program-aided Language Models

Luyu Gao, Aman Madaan|arXiv (Cornell University)|Nov 18, 2022

Topic Modeling被引用数 104

ひとこと要約

PaL は大規模言語モデルを用いて、自然言語と Python コードを中間的推論ステップとして交互に生成し、解決を Python インタプリタへオフロードして最終回答を得る。13個の数学・記号処理・アルゴリズムタスクで、チェーン・オブ・ソウト・ prompting を用いるよりも多くの場合で、少数ショットの最先端精度を達成する。

ABSTRACT

Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is decomposed correctly. In this paper, we present Program-Aided Language models (PAL): a novel approach that uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter. We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and other benchmarks. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1. Our code and data are publicly available at http://reasonwithpal.com/ .

研究の動機と目的

LLM での robust な推論を促進するため、問題の分解と解法の実行を分離する。
推論として実行可能なコードを生成することが、 diverse ベンチマークで chain-of-thought より精度を改善することを実証する。
PaL が gsm8k で最先端の few-shot パフォーマンスを発揮し、記号処理/アルゴリズム的タスクで競争力のある改善を示す。

提案手法

LLM は中間推論として自然言語とプログラミング言語のステップを交互に生成する (t = NL ∪ PL)。
最終回答は生成されたコードを Python インタプリタで実行することにより得られ、LLM はプロンプト内で最終回答を出力しない。
有意義な変数名と任意の NL コメントを用いたプロンプトが作成され、必要に応じて実行結果をフィードバックできる（実験ではポストホック実行を使用）。
実験では Math, Symbolic, Algorithmic タスクに対して Direct prompting、chain-of-thought prompting、PaL を比較し、基盤 LLM として Codex を用いる。
評価は固定プロンプトの few-shot 設定を含み、gsm8k に対しては大多数投票サンプリング（k>1）で精度を高める。

実験結果

リサーチクエスチョン

RQ1プログラム生成推論ステップと外部インタプリタを組み合わせると、Math、Symbolic、Algorithmic タスク全般で chain-of-thought prompting を上回ることができるか。
RQ2PaL は大規模数値演算とプロンプト変動性（変数名、NL コメント）に対する頑健性を、さまざまなベンチマークでどう示すか。
RQ3PaL の利点はコード生成プロンプト、インタプリタ、またはその組み合わせのどちらに起因するのか。
RQ4PaL は gsm8k のような標準ベンチマークで、chain-of-thought で訓練されたより大きな LLM と比較してどうか。

主な発見

Task	Direct (Codex)	CoT (UL2-20B)	CoT (LaMDA-137B)	CoT (Codex)	CoT (PaLM-540b)	CoT (Minerva 540B)	PaL (Codex)
gsm8k	19.7	4.1	17.1	65.6	56.9	58.8	72.0
gsm-hard	5.0	-	-	23.1	-	-	61.2
svamp	69.9	12.6	39.9	74.8	79.0	-	79.4
asdiv	74.0	16.9	49.0	76.9	73.9	-	79.6
singleeq	86.8	-	-	89.1	92.3	-	96.1
singleop	93.1	-	-	91.9	94.1	-	94.6
addsub	90.9	-	-	86.0	91.9	-	92.5
multiarith	44.0	10.7	51.8	95.9	94.7	-	99.2

PaL with Codex は gsm8k で少数ショットの最先端精度を達成し、チェーン・オブ・ソウトを使用するより大きなモデルを著しく上回る。
gsm-hard（大きな数）では PaL は頑健性を維持し、直接 prompting や CoT のプロンプトは大幅に劣化する。
BIG-Bench Hard の記号推論とアルゴリズム的タスク全体で PaL は CoT を大幅に上回り、色付き物体、ペンギン、日付、物体カウント、繰り返しコピーなど高精度で問題を解くタスクへのギャップを縮小する。
PaL の改善は、基盤 LMs が弱い場合や自然言語に主に訓練されている場合でも、コードモデリング能力が十分であれば持続する。
アブレーションは、意味のある変数名と NL コメントが PaL の有効性に寄与することを示しており、それらを除去すると性能が低下する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。