QUICK REVIEW

[論文レビュー] AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation

Vijayaraghavan Murali, Chandra Maddila|arXiv (Cornell University)|May 20, 2023

Scheduling and Optimization Algorithms被引用数 9

ひとこと要約

The paper presents CodeCompose, an AI-assisted code authoring tool deployed at Meta, detailing its model (InCoder-based), fine-tuning on internal code, system design, deployment across 9 languages, and multi-faceted evaluation of usage and user feedback.

ABSTRACT

Generative LLMs have been shown to effectively power AI-based code authoring tools that can suggest entire statements or blocks of code during code authoring. In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 9 programming languages and several coding surfaces. We present our experience in making design decisions about the model and system architecture for CodeCompose that addresses these challenges. To release a LLM model at this scale, we needed to first ensure that it is sufficiently accurate. In a random sample of 20K source code files, depending on the language, we are able to reproduce hidden lines between 40% and 58% of the time, an improvement of 1.4x and 4.1x over a model trained only on public data. We gradually rolled CodeCompose out to developers. At the time of this writing, 16K developers have used it with 8% of their code coming directly from CodeCompose. To triangulate our numerical findings, we conduct a thematic analysis on the feedback from 70 developers. We find that 91.5% of the feedback is positive, with the most common themes being discovering APIs, dealing with boilerplate code, and accelerating coding. Meta continues to integrate this feedback into CodeCompose.

研究の動機と目的

Demonstrate how an enterprise code assistant can be fine-tuned on internal code and deployed at scale.
Explore system design choices, latency optimizations, and UX considerations for large-scale industrial deployment.
Assess the impact of CodeCompose on developer productivity and satisfaction through quantitative and qualitative metrics.
Identify challenges around trust, accuracy, and integration with existing IDEs in a large organization.

提案手法

Fine-tune an InCoder-based LLM on Meta's internal code (CM/LCM objectives).
Use a bidirectional, fill-in-the-middle style training objective (LCM) with metadata like language, file path, and kernel.
Deploy via a client-server architecture with an LSP-based language server and Thrift-backed GPU inference tier.
Measure performance with offline exact-match and BLEU metrics across languages; collect real-world usage metrics (acceptance rate, percentage of code typed from suggestions) and qualitative user feedback.
Implement latency-focused optimizations (caching, debouncing, minimal batching) and a randomized rollout strategy to reduce bias.
Provide telemetry and support for multiple editors through aReusable LSP component and in-house editor surfaces.

Figure 1 . CodeCompose (a) offers inline code suggestions in VSCode in a grey text when the user is typing code (Tab to accept), (b) changes its suggestion to adapt to a natural language comment, (c) suggests code or documentation based on code below the current position.

実験結果

リサーチクエスチョン

RQ1How does fine-tuning on Meta's internal code improve code suggestion accuracy across multiple languages?
RQ2What architectural and UX decisions enable scalable, low-latency AI-assisted code completion at a large company?
RQ3What is the real-world impact of CodeCompose in terms of acceptance rates, usage, and user satisfaction?
RQ4What challenges (trust, hallucinations, integration with existing tools) arise in industrial deployments of AI code assistants?
RQ5How does contextual information (code before/after cursor, file, kernel) influence model performance?

主な発見

Language	# Suggestions shown	Acceptance rate	Percentage of code typed using CodeCompose	# Users
Python	1.87mn	22	8	10.7k
Hack	1.25mn	22.5	10	5.5k
C++	608.1k	20	10	2.5k
Flow (Javascript)	583.2k	18.2	7	2.5k
Rust	74.2k	17.2	9	212
Objective C++	56.5k	18	6	429
Objective C	34.4k	18.1	6	299
C	23.5k	21.3	12	201
Typescript	8.9k	19	10	76
All	4.5mn	22	8	16k

CodeCompose achieved a 22% acceptance rate across languages for displayed suggestions.
8% of the code typed by developers came from accepted CodeCompose suggestions.
Qualitative feedback showed 91.5% favorable reception among surveyed users.
Fine-tuning on Meta's internal data significantly improved exact-match and BLEU scores across Hack, Python, Flow, and C++; LCM training further boosted performance.
The system supports at least 9 languages and shows substantial real-world usage with 4.5 million suggestions over 15 days.
UX decisions (single-line suggestions, 300-500ms latency) and a wave-based rollout were effective in maintaining trust and usability.

Figure 2 . Steps to construct an input to the model in LCM with an example.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。