QUICK REVIEW

[论文解读] A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

Quanjun Zhang, Tongke Zhang|arXiv (Cornell University)|Oct 13, 2023

Software Engineering Research被引用 31

一句话总结

本文构建了 EvalGPTFix，用于评估 ChatGPT 在 AtCoder 的未见 Java 错误上的自动程序修复能力，结果表明在基本提示下 ChatGPT 修复了 151 个错误中的 109 个，使用改进的提示和对话后最多修复 143 个错误，优于 CodeT5 和 PLBART。还分析了提示设计、基于对话的修复，以及黑箱大语言模型在软件工程中的数据泄露问题等。

ABSTRACT

Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of Software Engineering (SE) tasks, such as Automated Program Repair (APR), code summarization, and code completion. For example, ChatGPT, the latest black-box LLM, has been investigated by numerous recent research studies and has shown impressive performance in various tasks. However, there exists a potential risk of data leakage since these LLMs are usually close-sourced with unknown specific training details, e.g., pre-training datasets. In this paper, we seek to review the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives. We first introduce {\benchmark}, a new benchmark with buggy and the corresponding fixed programs from competitive programming problems starting from 2023, after the training cutoff point of ChatGPT. The results on {\benchmark} show that ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5\% and 62.4\% prediction accuracy. We also investigate the impact of three types of prompts, i.e., problem description, error feedback, and bug localization, leading to additional 34 fixed bugs. Besides, we provide additional discussion from the interactive nature of ChatGPT to illustrate the capacity of a dialog-based repair workflow with 9 additional fixed bugs. Inspired by the findings, we further pinpoint various challenges and opportunities for advanced SE study equipped with such LLMs (e.g.,~ChatGPT) in the near future. More importantly, our work calls for more research on the reevaluation of the achievements obtained by existing black-box LLMs across various SE tasks, not limited to ChatGPT on APR.

研究动机与目标

在一个干净且未见的 APR 基准测试（EvalGPTFix）上评估 ChatGPT 的修复效果。
研究不同提示（问题描述、错误反馈、错误定位）如何影响修复性能。
探究基于对话的交互是否能改进与 ChatGPT 的迭代式错误修复。

提出的方法

构建 EvalGPTFix：来自 AtCoder 竞赛（2023）的 151 对有缺陷/正确的 Java 对，采用基于测试用例的验证以及静态/动态过滤以确保数据未见。
使用 ChatGPT（gpt-3.5-turbo）通过重复提示对每个错误进行修复，最多 35 轮；若连续三轮未产生新修复即停止。
通过在 FixEval 数据上对 CodeT5 和 PLBART 进行微调，并在 AtCoder 测试用例上评估补丁，与最先进的 LLMs 进行基准比较。
通过添加提示来评估提示效果：(a) 问题描述，(b) 错误信息，(c) 错误定位，以及 (d) 交互式对话，测量额外修复的错误数。
报告召回率、错误类型修复率，以及跨模型的重叠情况，以评估相对优势。

实验结果

研究问题

RQ1RQ1：在 EvalGPTFix 中，ChatGPT 修复有缺陷程序的效果如何？
RQ2RQ2：不同提示如何影响 ChatGPT 的修复性能？
RQ3RQ3：基于对话的交互是否能进一步提升 ChatGPT 的修复结果？

主要发现

在 EvalGPTFix 中，ChatGPT 使用基本提示修复了 151 个错误中的 109 个。
增加问题描述、错误信息和错误定位分别再修复 18 个、25 个和 10 个错误。
对话交互再带来 9 次额外修复，超出基于提示的尝试。
总体而言，ChatGPT 在 EvalGPTFix 中修复了 143 个错误，表明在修复现实世界中的有缺陷程序方面具有很大潜力。
CodeT5 修复了 79 个错误，PLBART 修复了 41 个错误，显示本研究中 ChatGPT 的修复能力优于它们。
ChatGPT 的输出存在明显的随机性，需要多轮（最多 35 轮）才能使结果稳定。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。