[論文レビュー] AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation
The paper presents CodeCompose, an AI-assisted code authoring tool deployed at Meta, detailing its model (InCoder-based), fine-tuning on internal code, system design, deployment across 9 languages, and multi-faceted evaluation of usage and user feedback.
Generative LLMs have been shown to effectively power AI-based code authoring tools that can suggest entire statements or blocks of code during code authoring. In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 9 programming languages and several coding surfaces. We present our experience in making design decisions about the model and system architecture for CodeCompose that addresses these challenges. To release a LLM model at this scale, we needed to first ensure that it is sufficiently accurate. In a random sample of 20K source code files, depending on the language, we are able to reproduce hidden lines between 40% and 58% of the time, an improvement of 1.4x and 4.1x over a model trained only on public data. We gradually rolled CodeCompose out to developers. At the time of this writing, 16K developers have used it with 8% of their code coming directly from CodeCompose. To triangulate our numerical findings, we conduct a thematic analysis on the feedback from 70 developers. We find that 91.5% of the feedback is positive, with the most common themes being discovering APIs, dealing with boilerplate code, and accelerating coding. Meta continues to integrate this feedback into CodeCompose.
研究の動機と目的
- Demonstrate how an enterprise code assistant can be fine-tuned on internal code and deployed at scale.
- Explore system design choices, latency optimizations, and UX considerations for large-scale industrial deployment.
- Assess the impact of CodeCompose on developer productivity and satisfaction through quantitative and qualitative metrics.
- Identify challenges around trust, accuracy, and integration with existing IDEs in a large organization.
提案手法
- Fine-tune an InCoder-based LLM on Meta's internal code (CM/LCM objectives).
- Use a bidirectional, fill-in-the-middle style training objective (LCM) with metadata like language, file path, and kernel.
- Deploy via a client-server architecture with an LSP-based language server and Thrift-backed GPU inference tier.
- Measure performance with offline exact-match and BLEU metrics across languages; collect real-world usage metrics (acceptance rate, percentage of code typed from suggestions) and qualitative user feedback.
- Implement latency-focused optimizations (caching, debouncing, minimal batching) and a randomized rollout strategy to reduce bias.
- Provide telemetry and support for multiple editors through aReusable LSP component and in-house editor surfaces.

実験結果
リサーチクエスチョン
- RQ1How does fine-tuning on Meta's internal code improve code suggestion accuracy across multiple languages?
- RQ2What architectural and UX decisions enable scalable, low-latency AI-assisted code completion at a large company?
- RQ3What is the real-world impact of CodeCompose in terms of acceptance rates, usage, and user satisfaction?
- RQ4What challenges (trust, hallucinations, integration with existing tools) arise in industrial deployments of AI code assistants?
- RQ5How does contextual information (code before/after cursor, file, kernel) influence model performance?
主な発見
| Language | # Suggestions shown | Acceptance rate | Percentage of code typed using CodeCompose | # Users |
|---|---|---|---|---|
| Python | 1.87mn | 22 | 8 | 10.7k |
| Hack | 1.25mn | 22.5 | 10 | 5.5k |
| C++ | 608.1k | 20 | 10 | 2.5k |
| Flow (Javascript) | 583.2k | 18.2 | 7 | 2.5k |
| Rust | 74.2k | 17.2 | 9 | 212 |
| Objective C++ | 56.5k | 18 | 6 | 429 |
| Objective C | 34.4k | 18.1 | 6 | 299 |
| C | 23.5k | 21.3 | 12 | 201 |
| Typescript | 8.9k | 19 | 10 | 76 |
| All | 4.5mn | 22 | 8 | 16k |
- CodeCompose achieved a 22% acceptance rate across languages for displayed suggestions.
- 8% of the code typed by developers came from accepted CodeCompose suggestions.
- Qualitative feedback showed 91.5% favorable reception among surveyed users.
- Fine-tuning on Meta's internal data significantly improved exact-match and BLEU scores across Hack, Python, Flow, and C++; LCM training further boosted performance.
- The system supports at least 9 languages and shows substantial real-world usage with 4.5 million suggestions over 15 days.
- UX decisions (single-line suggestions, 300-500ms latency) and a wave-based rollout were effective in maintaining trust and usability.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。