QUICK REVIEW

[論文レビュー] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia|arXiv (Cornell University)|May 2, 2023

Software Engineering Research被引用数 171

ひとこと要約

この論文は EvalPlus を提示します。大規模で型を考慮したテスト入力を用いた自動フレームワークで、LLM が生成したコードの機能的正確性を厳密に評価し、既存のベンチマークでのエラーを過小評価していることを明らかにします。

ABSTRACT

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.

研究の動機と目的

既存のコード生成ベンチマークが、LLM生成コードの正確性を正確に測定する際の限界を特定する。
EvalPlus を開発し、多様で有効なテスト入力を自動生成して HumanEval などのベンチマークを強化し、より厳密な評価を実現する。
強化されたテストスイートが 26 の LLM に対する pass@k 指標に与える影響を示し、ランキングの変動を明らかにする。
人気のあるベンチマークにおける真実解の正確性の問題を指摘し、自動化されたテストを用いた改善された評価を提案する。

提案手法

既存のコードベンチマークを、LLM ベースのシード生成と変異ベースの多様化の双方を用いて大規模に自動生成されたテスト入力で拡張する。
高品質なシードから多数の有効な入力を作るために型認識の変異を使用する。
真実解実装に対する差分テストを正確性のオラクルとして使用する。
テスト効果を保ちながらテスト数を減らすために集合被覆によるテストスイート削減を適用する。
プログラム契約でタスクに注釈を付け、無効な入力をフィルタしテスト生成を導く。
評価信頼性を確保するために真実解ソリューションを再実装・検証する。

実験結果

リサーチクエスチョン

RQ1現在のコード生成ベンチマークは、誤ったLLM生成コードを検出する際にどれだけ適切ですか？
RQ2大規模で多様なテストケース（HumanEval +）を追加すると、元の HumanEval と比較して、さまざまな LLM の測定済みパフォーマンス（pass@k）は変わりますか？
RQ3自動化されたテスト入力生成は、標準ベンチマークで見えないLLM間のランキングの変化を明らかにできますか？
RQ4テストスイート削減が評価の有効性とコストに与える影響は？
RQ5人気のあるベンチマークの真実解ソリューションに欠陥はありますか、そして自動化されたテストでそれを暴けますか？

主な発見

HumanEval + は、多くのモデルにとって base HumanEval と比較して、k の値全体で最大で19.3-28.9%の pass@k 減少をもたらす。
HumanEval + での評価は相対的なランキングを変え、いくつかのオープンソースモデルが HumanEval + では ChatGPT を上回るが、HumanEval ではそうではない。
HumanEval のコード真実解には欠陥が含まれている（問題の約11%）で、論理の誤りやエッジケースの処理を含み、EvalPlus によって露呈した。
テストスイート削減は、使用する削減戦略により約47倍少ないテスト数（HumanEval + mini）でも同様の効果を達成できる。
低い温度は小さな k を、低い温度は小さい k に有効であり、高い温度は大きな k に有効であり、最適温度は HumanEval + の前後で比較的安定しているが、いくつかのモデルで依然として変動を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。