QUICK REVIEW

[论文解读] Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming

Hussein Mozannar, Gagan Bansal|arXiv (Cornell University)|Oct 25, 2022

Software Engineering Research被引用 21

一句话总结

该论文介绍 CodeRec User Programming States (CUPS)，是一种与 Copilot 相关的程序员活动分类法，并提供带标签的数据研究和度量，用以分析交互成本与界面设计的影响。

ABSTRACT

Code-recommendation systems, such as Copilot and CodeWhisperer, have the potential to improve programmer productivity by suggesting and auto-completing code. However, to fully realize their potential, we must understand how programmers interact with these systems and identify ways to improve that interaction. To seek insights about human-AI collaboration with code recommendations systems, we studied GitHub Copilot, a code-recommendation system used by millions of programmers daily. We developed CUPS, a taxonomy of common programmer activities when interacting with Copilot. Our study of 21 programmers, who completed coding tasks and retrospectively labeled their sessions with CUPS, showed that CUPS can help us understand how programmers interact with code-recommendation systems, revealing inefficiencies and time costs. Our insights reveal how programmers interact with Copilot and motivate new interface designs and metrics.

研究动机与目标

开发一个关于与 Copilot 等代码推荐系统交互时的程序员活动分类法 (CUPS)。
收集并公开一个带有 CUPS 和视频数据标注的编码会话数据集。
提出一种工具，用于衡量和分析在 AI 辅助编程过程中的用户行为模式及时间成本。

提出的方法

基于试点交互和片段类型（User Typing or Paused, User Before Action）开发了 12 种状态的 CUPS 分类法。
在 21 位开发者的会话中收集 Copilot 的遥测数据（3137 条带标签的片段，1024 条建议），并让参与者通过专用工具回顾性标注片段。
设计并使用标注工具对遥测片段进行 CUPS 状态标注，并分析状态驻留时间和转移以量化交互成本。
进行了任务变量分析，以检验接受率、状态时间分布以及按任务和 coder 经验的差异。
提供用于复现标注和分析的公开代码与数据仓库。

实验结果

研究问题

RQ1程序员在与 AI 驱动的代码推荐（Copilot）交互时会执行哪些活动？
RQ2程序员在 Copilot 相关状态上花费多少时间，哪些状态主导交互成本？
RQ3接受率和 CUPS 状态分布如何随编程任务和程序员专业水平的不同而变化？
RQ4哪些界面洞察或设计变更可以降低交互成本、提升 CodeRec 在 AI 辅助编程中的生产力？

主要发现

验证建议是最耗时的活动，在参与者的会话时间中平均占比为 22.4%。
编写新功能是第二耗时的 Copilot 相关状态，平均占比 14.05%。
与 Copilot 相关的状态（Verifying、Deferring Thought、Waiting、Prompt Crafting、Editing Suggestion）共占平均会话时间的 51.4%。
接受率因任务而异，数据处理任务的接受率较低（24.8%），模板/代码编写任务的接受率较高（41.9%）。
编程经验和既往使用 Copilot 的经历与较高的接受率相关（更有经验者约为 30% vs 较少经验者约为 38%；有过 Copilot 使用经验者约 38% vs 非使用者约 29%）。
推迟思考很常见，表明用户常选择接受建议以便稍后检查或通过高亮查看，而不是立即进行验证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。