QUICK REVIEW

[论文解读] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia|arXiv (Cornell University)|May 2, 2023

Software Engineering Research被引用 171

一句话总结

本文提出 EvalPlus，一种自动化框架，通过用大规模、类型感知的测试输入来扩展代码生成基准，以严格评估 LLM 生成代码的功能正确性，并揭示现有基准在错误估计方面存在的显著低估。

ABSTRACT

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.

研究动机与目标

识别现有代码生成基准在准确衡量 LLM 生成代码正确性方面的局限性。
开发 EvalPlus，自动生成多样且有效的测试输入并扩展如 HumanEval 等基准，以实现更强的评估。
展示增强的测试用例集如何影响 26 个 LLM 的 pass@k 指标并揭示排名变动。
突出流行基准中的 Ground-truth 正确性问题，并提出使用自动化测试的改进评估。

提出的方法

用基于 LLM 的种子生成和基于变异的多样化，结合大规模、自动生成的测试输入，扩充现有的代码基准。
使用类型感知变异从高质量种子创建大量有效输入。
将差分测试作为正确性 oracle，对比 ground-truth 实现。
通过集合覆盖进行测试用例简化，以用更少的测试保持测试效果。
用程序契约对任务进行注释，以过滤无效输入并引导测试生成。
重新实现并验证 ground-truth 解决方案以确保评估的可靠性。

实验结果

研究问题

RQ1当前的代码生成基准在检测 LLM 生成的错误代码方面有多充分？
RQ2添加大规模、多样化的测试用例（HumanEval +）是否会改变对多种 LLM 的测量性能（pass@k），相较于原始的 HumanEval？
RQ3自动化测试输入生成能否揭示在标准基准下不可见的 LLM 排名变化？
RQ4测试用例简化对评估效果和成本有何影响？
RQ5在流行基准中的 ground-truth 解决方案是否存在缺陷，自动化测试能否揭示它们？

主要发现

与 Base HumanEval 相比，HumanEval + 在许多模型的 pass@k 上显著下降（在不同 k 值下最高达 19.3-28.9%）。
在 HumanEval + 上的评估改变了相对排名，某些开源模型在 HumanEval + 上超过了 ChatGPT，但在 HumanEval 上未超过。
HumanEval 的代码 ground-truth 存在缺陷（约 11% 的问题），包括逻辑错误和边界情况处理，被 EvalPlus 揭示。
测试用例简化在使用的简化策略下，约 47 倍更少的测试就能达到相似的效果（HumanEval + mini）。
较低的 temperature 有利于较小的 k，而较高的 temperature 有利于较大的 k，最佳温度在执行 HumanEval + 之前/之后都相当稳定，但仍对某些模型有些许变化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。