QUICK REVIEW

[论文解读] AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation

Vijayaraghavan Murali, Chandra Maddila|arXiv (Cornell University)|May 20, 2023

Scheduling and Optimization Algorithms被引用 9

一句话总结

本文介绍 CodeCompose，这是在 Meta 部署的 AI 辅助代码编写工具，详细描述其基于 InCoder 的模型、对内部代码的微调、系统设计、在 9 种语言的部署，以及对使用情况和用户反馈的多维评估。

ABSTRACT

Generative LLMs have been shown to effectively power AI-based code authoring tools that can suggest entire statements or blocks of code during code authoring. In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 9 programming languages and several coding surfaces. We present our experience in making design decisions about the model and system architecture for CodeCompose that addresses these challenges. To release a LLM model at this scale, we needed to first ensure that it is sufficiently accurate. In a random sample of 20K source code files, depending on the language, we are able to reproduce hidden lines between 40% and 58% of the time, an improvement of 1.4x and 4.1x over a model trained only on public data. We gradually rolled CodeCompose out to developers. At the time of this writing, 16K developers have used it with 8% of their code coming directly from CodeCompose. To triangulate our numerical findings, we conduct a thematic analysis on the feedback from 70 developers. We find that 91.5% of the feedback is positive, with the most common themes being discovering APIs, dealing with boilerplate code, and accelerating coding. Meta continues to integrate this feedback into CodeCompose.

研究动机与目标

展示企业级代码助手如何在内部代码上进行微调并实现大规模部署。
探讨面向大规模工业部署的系统设计选择、延迟优化和 UX 考量。
通过定量和定性指标评估 CodeCompose 对开发者生产力和满意度的影响。
在大型组织中识别信任、准确性和与现有 IDE 集成方面的挑战。

提出的方法

在 Meta 的内部代码（CM/LCM 目标）上对基于 InCoder 的大模型进行微调。
使用双向填充中间风格的训练目标（LCM），并附带元数据如语言、文件路径和内核信息。
通过客户端-服务器架构部署，使用基于 LSP 的语言服务器和 Thrift 支撑的 GPU 推理层。
在离线精确匹配和 BLEU 指标下跨语言衡量性能；收集真实世界使用指标（接受率、从建议中输入代码的比例）和定性用户反馈。
实现以延迟为核心的优化（缓存、去抖动、最小批量处理）以及随机化推 rollout 策略以降低偏差。
通过可复用的 LSP 组件和自研编辑界面，为多编辑器提供遥测与支持。

Figure 1 . CodeCompose (a) offers inline code suggestions in VSCode in a grey text when the user is typing code (Tab to accept), (b) changes its suggestion to adapt to a natural language comment, (c) suggests code or documentation based on code below the current position.

实验结果

研究问题

RQ1对 Meta 内部代码的微调如何提升多语言的代码建议准确性？
RQ2哪些架构和 UX 决策能够在大型企业中实现可扩展、低延迟的 AI 辅助代码完成？
RQ3CodeCompose 在现实世界中的接受率、使用情况和用户满意度有哪些影响？
RQ4在工业部署的 AI 代码助手中会出现哪些挑战（信任、幻觉、与现有工具的集成）？
RQ5上下文信息（光标前后的代码、文件、内核）如何影响模型表现？

主要发现

Language	# Suggestions shown	Acceptance rate	Percentage of code typed using CodeCompose	# Users
Python	1.87mn	22	8	10.7k
Hack	1.25mn	22.5	10	5.5k
C++	608.1k	20	10	2.5k
Flow (Javascript)	583.2k	18.2	7	2.5k
Rust	74.2k	17.2	9	212
Objective C++	56.5k	18	6	429
Objective C	34.4k	18.1	6	299
C	23.5k	21.3	12	201
Typescript	8.9k	19	10	76
All	4.5mn	22	8	16k

CodeCompose 在所显示的建议中实现了跨语言的 22% 接受率。
开发者输入的代码中有 8% 来自被 CodeCompose 接受的建议。
定性反馈显示受访用户中有 91.5% 给出积极评价。
对 Meta 内部数据的微调显著提高 Hack、Python、Flow、C++ 的精确匹配和 BLEU 分数；LCM 训练进一步提升性能。
系统支持至少 9 种语言，且在现实世界中显示出显著使用情况，15 天内提出了 450 万条建议。
UX 决策（单行建议、300-500ms 延迟）和波段式 rollout 在维持信任和可用性方面效果显著。

Figure 2 . Steps to construct an input to the model in LCM with an example.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。