QUICK REVIEW

[论文解读] Generation Probabilities Are Not Enough: Uncertainty Highlighting in AI Code Completions

M. Helena Vasconcelos, Gagan Bansal|arXiv (Cornell University)|Feb 14, 2023

Software Engineering Research被引用 17

一句话总结

该论文比较了两种用于AI辅助代码补全的不确定性高亮方法，结果显示高亮可能被编辑的标记（而非其生成概率）可以加快工作速度并产生更有针对性的编辑，而基于生成概率的高亮没有明显收益。

ABSTRACT

Large-scale generative models enabled the development of AI-powered code completion tools to assist programmers in writing code. However, much like other AI-powered tools, AI-powered code completions are not always accurate, potentially introducing bugs or even security vulnerabilities into code if not properly detected and corrected by a human programmer. One technique that has been proposed and implemented to help programmers identify potential errors is to highlight uncertain tokens. However, there have been no empirical studies exploring the effectiveness of this technique -- nor investigating the different and not-yet-agreed-upon notions of uncertainty in the context of generative models. We explore the question of whether conveying information about uncertainty enables programmers to more quickly and accurately produce code when collaborating with an AI-powered code completion tool, and if so, what measure of uncertainty best fits programmers' needs. Through a mixed-methods study with 30 programmers, we compare three conditions: providing the AI system's code completion alone, highlighting tokens with the lowest likelihood of being generated by the underlying generative model, and highlighting tokens with the highest predicted likelihood of being edited by a programmer. We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits, and is subjectively preferred by study participants. In contrast, highlighting tokens according to their probability of being generated does not provide any benefit over the baseline with no highlighting. We further explore the design space of how to convey uncertainty in AI-powered code completion tools, and find that programmers prefer highlights that are granular, informative, interpretable, and not overwhelming.

研究动机与目标

了解不确定性高亮在使用 AI 驱动的代码补全时对程序员性能的影响。
在多个编码任务中比较两种不确定性概念——生成概率与编辑可能性——的差异。
识别将不确定性传达给程序员的设计偏好。
调查不同不确定性高亮方案的主观效用和认知负担。

提出的方法

进行一项内同J性混合方法研究，30名程序员完成三项编码任务。
实现两种不确定性高亮条件：基于生成概率的高亮（阈值约为 69.4%）以及基于编辑模型的高亮（至少有6名参与者中有4人编辑的标记）。
训练一个封闭世界编辑模型，基于参与者对 Codex 生成补全的编辑来预测一个标记将被编辑的可能性。
与无高亮基线在多项性能和主观指标上进行比较。
预注册并分析九个与时间、准确度、标记存活、认知负荷和感知效用相关的假设。

实验结果

研究问题

RQ1在 AI 辅助编码中，不确定性高亮是否改善任务完成时间和准确性？
RQ2哪一种不确定性概念（生成概率 vs. 编辑似然）对程序员更有益？
RQ3程序员在代码补全中的不确定性高亮设计偏好是什么？

主要发现

高亮预测最可能被编辑的标记可实现任务更快完成并进行更有针对性的编辑。
基于生成概率的高亮相比无高亮基线没有性能提升。
编辑模型的高亮提高了参与者编辑已高亮标记的概率，且主观上更受偏好。
程序员偏好粒度细、信息量足、可解释且不过载的不确定性高亮，偏好使用阴影而非精确概率。
有证据表明在使用编辑模型高亮时跨任务的准确性有所提高，尽管由于样本量原因并非总是具有统计显著性。
一个简单的、在程序编辑数据上训练的封闭世界编辑模型可以作为代码生成中不确定性的可行探针。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。