QUICK REVIEW

[論文レビュー] CodeT: Code Generation with Generated Tests

Bei Chen, Fengji Zhang|arXiv (Cornell University)|Jul 21, 2022

Software Testing and Debugging Techniques被引用数 64

ひとこと要約

CodeTはコード生成に用いられる同じ事前学習済み言語モデルを用いて自動的にテストケースを生成し、次にデュアル実行合意を用いて複数のサンプルから最適なコード解を選択します。これにより、複数のベンチマークとモデルにおいてpass@1を向上させます。

ABSTRACT

The task of generating code solutions for a given programming problem can benefit from the use of pre-trained language models such as Codex, which can produce multiple diverse samples. However, a major challenge for this task is to select the most appropriate solution from the multiple samples generated by the pre-trained language models. A natural way to evaluate the quality and correctness of a code solution is to run it against a set of test cases, but the manual creation of such test cases is often costly and time-consuming. In this paper, we propose a novel method, CodeT, that leverages the same pre-trained language models to automatically generate test cases for the code samples, thus reducing the human effort and increasing the coverage of the test scenarios. CodeT then executes the code samples using the generated test cases, and performs a dual execution agreement, which considers both the consistency of the outputs against the generated test cases and the agreement of the outputs with other code samples. We conduct comprehensive experiments on four benchmarks, HumanEval, MBPP, APPS and CodeContests, using five different pre-trained language models with varying sizes and capabilities. Our results show that CodeT can significantly improve the performance of code solution selection over previous methods, achieving remarkable and consistent gains across different models and benchmarks. For instance, CodeT improves the pass@1 metric on HumanEval to 65.8%, which represents an absolute improvement of 18.8% over the code-davinci-002 model, and an absolute improvement of more than 20% over the previous state-of-the-art results.

研究の動機と目的

コード生成に用いられるのと同じLMを用いて自動的にテストケースを生成することで、手作業で作成したテストケースへの依存を減らす。
生成されたテストに基づく実行ベースの合意によって、複数サンプルからのコード解をより良く選択する。
テスト結果と解法間の整合性のデュアルアグリメントを活用することで、評価の頑健性とカバー範囲を拡大する。
ゼロショット設定で、複数のベンチマークとモデルファミリーに対する有効性を示す。

提案手法

コード生成に用いる同じ事前学習済みLMにプロンプトをかけて、入力-出力ペアを出力させることで各プログラミング問題のテストケースを生成する。
ラベル付きデータを必要とせず、LMを用いて問題文脈から大規模なコード解答Xを生成する。
RANSACに触発されたデュアル実行合意を適用し、共通のテストケースを通過し相互に一致する(code, test)ペアの合意集合を見つける。
f(S) = |Sx| * |Sy|で合意集合をランク付けし、上位の合意集合から最良のコード解を選ぶ。
任意で解を重複排除し、重複排除有無で性能を比較する（アブレーションでは効果が少ないことが示されている）。
生成されたテストケースを用い、地上_truthラベルデータではなく、複数のベンチマークとLMファミリでゼロショット設定のpass@kを評価する。

実験結果

リサーチクエスチョン

RQ1LM生成テストケースの品質とカバレッジは、コード選択を推進するうえでどの程度良いのか？
RQ2デュアル実行合意は、異なるモデルとベンチマークで正しい解の選択を改善するのか？
RQ3CodeTは多様なベンチマークとモデルサイズでゼロショット設定でどの程度機能するか？
RQ4生成されるテストケースの数とテストケースの品質（毒性、正確さ、カバレッジ）に対してCodeTはどれくらい感度があるか？

主な発見

CodeTは、ベンチマークとモデル全体でpass@1を著しく改善します。例えば、HumanEvalのcode-davinci-002では、47.0%（ベースライン）から65.8%（CodeT）へ。
MBPPでcode-davinci-002を用いると、pass@1が58.1%から67.7%へ改善。
APPS Introductoryでは、pass@1が27.2%から34.6%へ改善。
CodeContestsでは、CodeTで0.7%から2.1%へ改善（ゼロショット）。
CodeTはCodex、InCoder、CodeGenファミリ全体で一貫した利得を達成し、報告されたすべての設定でAlphaCode-Cを上回る。
テストケースの品質（正確さ、毒性、カバレッジ）はCodeTの利得と相関し、コード-davinci-002由来の高品質なテストケースほど大きな改善をもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。