QUICK REVIEW

[论文解读] GitHub Copilot AI pair programmer: Asset or Liability?

Arghavan Moradi Dakhel, Vahid Majdinasab|arXiv (Cornell University)|Jun 30, 2022

Software Engineering Research被引用 40

一句话总结

该论文通过测试 Copilot 解决基本算法问题的能力，并将其解与人类程序员在 Python 任务上的解进行比较，来实证评估 GitHub Copilot 作为 AI 伴侣程序员的表现。

ABSTRACT

Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called Copilot, has been proposed by OpenAI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it effectively. In this paper, we study the capabilities of Copilot in two different programming tasks: (i) generating (and reproducing) correct and efficient solutions for fundamental algorithmic problems, and (ii) comparing Copilot's proposed solutions with those of human programmers on a set of programming tasks. For the former, we assess the performance and functionality of Copilot in solving selected fundamental problems in computer science, like sorting and implementing data structures. In the latter, a dataset of programming problems with human-provided solutions is used. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems, however, some solutions are buggy and non-reproducible. Moreover, Copilot has some difficulties in combining multiple methods to generate a solution. Comparing Copilot to humans, our results show that the correct ratio of humans' solutions is greater than Copilot's suggestions, while the buggy solutions generated by Copilot require less effort to be repaired.

研究动机与目标

评估 Copilot 是否能够为基本算法问题生成正确且高效的解决方案。
将 Copilot 生成的解决方案与人类解决方案在一组 Python 编程任务数据集上进行比较。
评估正确性、效率、可重复性，以及 Copilot 输出的相似性等指标。

提出的方法

使用来自标准算法设计书的提示，对涵盖排序、数据结构、图和贪婪算法的基础算法问题进行测试。
在相距 30 天的两次试验中，对每个提示评估多种 Copilot 回应以评估一致性。
通过与真实算法和单元测试相比对，手动评估正确性，并检查算法保真性。
通过检查是否至少有一个正确解使用了最优算法来衡量代码最优性。
使用基于 AST 的相似性评估在尝试和试验之间的代码可重复性与相似性。
使用来自课程的 Python 编程任务数据集将 Copilot 的性能与人类解决方案进行比较，其中包括正确提交、有 Bug 的提交以及修复工具。

实验结果

研究问题

RQ1RQ1：Copilot 是否能够为基本算法问题提出正确且高效的解决方案？
RQ2RQ2：Copilot 的解决方案在解决编程问题方面是否具备与人类解决方案的竞争力？

主要发现

Copilot 能为大多数基本问题生成解决方案，但有些解决方案存在错误且不可重复。
Copilot 在将多种方法组合成完整解决方案方面存在困难。
与人类相比，Copilot 的正确解率较低，且解法多样性较低。
有 Bug 的 Copilot 解法更容易修复，但初学者可能更难筛选。
如果被经验丰富的开发者使用，Copilot 的解法质量可与人类相当，具备资产般的价值；若被初学者使用，可能因有 bug 或非最优代码而成为负担。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。