QUICK REVIEW

[论文解读] What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Shihan Dou, Haoxiang Jia|arXiv (Cornell University)|Jul 8, 2024

Hate Speech and Cyberbullying Detection被引用 6

一句话总结

本文在七个 LLMs 跨基准对代码进行实证分析，建立一个缺陷分类法，创建一个真实世界基准，并提出一种无需微调的自我批评方法来修复错误。

ABSTRACT

The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and ten sub-categories, and analyzed the root cause for common bug types. To better understand the performance of LLMs in real-world projects, we also manually created a real-world benchmark RWPB. We analyzed bugs on RWPB to highlight distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Our comprehensive and extensive study provides insights into the current limitations of LLM-based code generation and opportunities for enhancing the accuracy and quality of the generated code.

研究动机与目标

评估领先的闭源与开源 LLMs 在 Python 任务上生成代码的准确性与特征。
描述生成代码中的缺陷类型及其分布。
将标准基准与真实世界、手工整理基准（RWPB）上的表现进行比较。
提出一种无需训练的自我批评方法以减少缺陷并提高通过率。

提出的方法

评估七个 LLMs（3 个闭源，4 个开源）在 HumanEval+、MBPP+、和 APPS+ 的1,164个编程问题上的表现。
测量生成代码的长度、圈复杂度和 API 使用情况，并与规范解进行比较。
开发一个两阶段的缺陷注释过程（基于脚本的初始分类法加人工细化）以将缺陷分为3种主要类型和12个子类型。
从140个 GitHub 任务构建真实世界基准（RWPB），以将真实世界的缺陷分布与基准进行比较。
引入一种自我批评迭代方法，在其中 LLMs 基于缺陷分类和编译器反馈对自己的代码进行批评和纠正，且无需额外训练。
报告通过率的提升并分析任务复杂性如何影响 LLM 性能。

实验结果

研究问题

RQ1RQ1：LLMs 在代码生成方面的有效性如何，以及任务复杂性如何影响性能？
RQ2RQ2：在基准测试中，LLM 生成代码的根本原因与缺陷分布是什么？
RQ3RQ3：如何构建一个真实世界的基准以最小化数据泄漏，并且真实世界的缺陷与基准缺陷相比如何？
RQ4RQ4：无需训练的自我批评方法是否能减轻缺陷并提高生成代码的正确性？

主要发现

闭源 LLMs 在复杂任务上尤其优于开源 LLMs（GPT-4 和 Claude-3 表现最佳；Phi-3 落后）。
生成的代码往往更短，但循环复杂度更高，且 API 使用与规范解相似。
错误代码往往包含比正确代码更多的注释，表明注释与复杂性相关而非正确性。
功能性缺陷是主要问题，语法和运行时缺陷也存在；复杂问题导致超时或算法子最优。
真实世界基准结果显示 Claude-3 在 RWPB 上达到 45.7% 准确率，而 Phi-3 达到 22%，缺陷分布与基准不同。
自我批评方法在两次迭代后通过率提高了 29.2%，且无需额外训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。