QUICK REVIEW

[論文レビュー] Program Synthesis with Large Language Models

Jacob Austin|arXiv (Cornell University)|Aug 16, 2021

Software Engineering Research参考文献 94被引用数 28

ひとこと要約

この論文は、2つのベンチマーク（MBPPとMathQA-Python）を用いたPythonコード合成に対して、最大137Bパラメータの大規模Transformer言語モデルを評価し、Few-shotおよび微調整フェーズでのサイズに基づく性能向上を示し、ダIALOGベースの人間のフィードバックを探究し、意味的グラウンディングと実行予測の限界を分析する。

ABSTRACT

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

研究の動機と目的

自然言語の記述から短いPythonプログラムを合成する一般目的の大規模言語モデルの能力を調査する。
MBPPとMathQA-Pythonの2つのPythonコード合成データセットを作成し、異なる言語的・プログラム的課題での合成を評価する。
few-shotおよび微調整設定で、モデルサイズ（244Mから137Bパラメータ）全体での性能を評価する。
対話と人間のフィードバックが合成コードの改善に与える影響を検討する。
入力からプログラム出力を予測することで意味的グラウンディングを評価し、限界を分析する。

提案手法

コードを含むソースを含む大規模な広範なWebデータで訓練された、密な左から右のデコーダー専用Transformer言語モデルを使用する。
Few-shotプロンプトとタスク固有データセットでの微調整（MBPP: 374の微調整例、MathQA-Python: より大きな微調整セット）で合成を評価する。
問題ごとに温度サンプリングを用いて複数のサンプルを生成し、コードを実行して機能的正確性を検証し、テストケースと比較する。
プロンプト設計実験には、プロンプト内の例の数と選択を変え、プロンプト感度と潜在的なプロンプトチューニングの利点を評価する。
エラーの分類（実行時、構文、テストアサーションの失敗）を分析し、モデルサイズがエラー分布にどう影響するかを評価する。
事前学習データとのプロンプト重複を考慮し、プロンプト／テスト構造を超えた一般化を評価するための敵対的風のチェックを実施する。

実験結果

リサーチクエスチョン

RQ1大規模言語モデルは、MBPPおよびMathQA-Python全体で自然言語の記述からPythonプログラムをどれだけうまく合成できるのか？
RQ2モデルサイズはfew-shotおよび微調整済みのプログラム合成性能にどのように影響するか？
RQ3対話と人間のフィードバックは合成の正確さを意味のある程度改善できるか？
RQ4与えられた入力に対してプログラム出力を予測することで、モデルはどの程度意味的にコードをグラウンディングしているか？
RQ5生成されたプログラムは、プロンプトテストを超えた敵対的または拡張されたテストケースに対してどれだけ頑健か？

主な発見

合成性能はモデルサイズと対数的に線形にスケールし、最大のモデルはfew-shotプロンプトでMBPP問題の最大59.6%、微調整後にMathQA-Pythonで約83.8%を解く。
微調整はMBPP全体のモデルサイズを超えて約10ポイントの利得をもたらし、より大規模な微調整ではMathQA-Pythonでより大きな利得を得る。
対話ベースの人間のフィードバックはエラー率を約半減させ、4回の対話でfew-shotの性能を約30%から約65%に改善できる。
モデルはプロンプト内容を単に繰り返すのではなく、保持されたテストケースへ一般化する傾向がある一方、敵対的テストシナリオではいくつかの失敗が生じる。MBPPとの事前学習／テストの重複は比較的最小限である。
最高クラスのモデルでも意味的グラウンディングは限定的で、生成されたプログラムの任意の入力に対する実行結果を信頼性高く予測できない。構文生成と真の理解のギャップを示している。
BLEUスコアは合成成功と相関が低く、サンプリング戦略（temperature）が性能に大きく影響する。評価予算が厳しい場合はグリーディデコードの方が効果的である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。