QUICK REVIEW

[论文解读] Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study

Qi Guo, Junming Cao|arXiv (Cornell University)|Sep 15, 2023

Software Engineering Research被引用 9

一句话总结

本文基于评审意见对代码 refinement 的 ChatGPT 进行实证评估，显示在新数据集上的泛化能力优于 CodeReviewer，但仍有提升空间，并分析提示词、温度及错误根本原因。

ABSTRACT

Code review is an essential activity for ensuring the quality and maintainability of software projects. However, it is a time-consuming and often error-prone task that can significantly impact the development process. Recently, ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks, suggesting its potential to automate code review processes. However, it is still unclear how well ChatGPT performs in code review tasks. To fill this gap, in this paper, we conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks, specifically focusing on automated code refinement based on given code reviews. To conduct the study, we select the existing benchmark CodeReview and construct a new code review dataset with high quality. We use CodeReviewer, a state-of-the-art code review tool, as a baseline for comparison with ChatGPT. Our results show that ChatGPT outperforms CodeReviewer in code refinement tasks. Specifically, our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset. We further identify the root causes for ChatGPT's underperformance and propose several strategies to mitigate these challenges. Our study provides insights into the potential of ChatGPT in automating the code review process, and highlights the potential research directions.

研究动机与目标

评估 ChatGPT 根据评审意见改进代码的能力，与现有方法 CodeReviewer 进行比较。
确定能够最大化 ChatGPT 性能的设置（提示词与温度）。
检视 ChatGPT 表现不佳的情形并揭示根本原因。
提出缓解策略并为 generalization 评估创建高质量 CodeReview-New 数据集。

提出的方法

以 CodeReview 作为基线并构建 CodeReview-New 以测试泛化能力。
在多个数据集上对比 CodeReviewer，使用零-shot 的 GPT-3.5-Turbo 评估 ChatGPT。
系统性地改变提示词与温度，评估对 EM 和 BLEU 指标的影响。
引入 EM-trim 与 BLEU-trim 以更好地反映实际的代码 refinement 结果。
对失败案例进行定性分析以识别根本原因与缓解策略。
在不佳案例中可选比较 GPT-4 (GPT-4) 的表现（RQ4）。

实验结果

研究问题

RQ1RQ1: 不同的 ChatGPT 提示词和温度设置如何影响代码 refinement 的性能？
RQ2RQ2: ChatGPT 相对于现有最先进的 CodeReviewer 在标准与新数据集上的表现如何？
RQ3RQ3: 在哪些情形下 ChatGPT 表现良好或较差，出现了哪些定性模式？
RQ4RQ4: ChatGPT 表现不佳的根本原因是什么，如何减缓？

主要发现

数据集	工具	样本数	EM	EM-T	BLEU	BLEU-T
CR	CodeReviewer	13,104	32.49	32.55	83.39	83.50
CR	ChatGPT	16.70	19.47	68.26	75.12
CRN	CodeReviewer	14,568	14.84	15.50	62.25	62.88
CRN	ChatGPT	19.52	22.78	72.56	76.44
CRNT	CodeReviewer	9,117	15.75	16.31	62.01	62.47
CRNT	ChatGPT	19.60	22.44	72.90	76.55
CRNL	CodeReviewer	5,451	13.21	14.05	62.67	63.61
CRNL	ChatGPT	19.39	23.40	71.97	76.25

较低的温度（0）和简洁的提示词通常能产生最强且最稳定的结果。
带有情景描述的提示变体（P2、P5）优于其他，而过于详细的提示（P3）可能削弱性能。
在 CodeReview-New 数据集上，ChatGPT 在 EM-T 和 BLEU-T 上优于 CodeReviewer（EM-T：22.78 对 15.50；BLEU-T：76.44 对 62.88）。
ChatGPT 在 CodeReview-New 上的 EM 与 BLEU 高于 CodeReview，显示对未见语言和评审的更好泛化。
CodeReview（CR）上的 EM 为 16.70（EM）、19.47（EM-T）；BLEU 为 68.26，BLEU-T 为 75.12，表明在现有基准上仍有改进空间。
根本原因分析指出对评审的理解、过度删除、额外修改以及真实标注的歧义是关键的失败模式。

Figure 2. Construction strategies of Prompt 1 and Prompt 2

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。