Skip to main content
QUICK REVIEW

[論文レビュー] AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation

Vijayaraghavan Murali, Chandra Maddila|arXiv (Cornell University)|May 20, 2023
Scheduling and Optimization Algorithms被引用数 9
ひとこと要約

The paper presents CodeCompose, an AI-assisted code authoring tool deployed at Meta, detailing its model (InCoder-based), fine-tuning on internal code, system design, deployment across 9 languages, and multi-faceted evaluation of usage and user feedback.

ABSTRACT

Generative LLMs have been shown to effectively power AI-based code authoring tools that can suggest entire statements or blocks of code during code authoring. In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 9 programming languages and several coding surfaces. We present our experience in making design decisions about the model and system architecture for CodeCompose that addresses these challenges. To release a LLM model at this scale, we needed to first ensure that it is sufficiently accurate. In a random sample of 20K source code files, depending on the language, we are able to reproduce hidden lines between 40% and 58% of the time, an improvement of 1.4x and 4.1x over a model trained only on public data. We gradually rolled CodeCompose out to developers. At the time of this writing, 16K developers have used it with 8% of their code coming directly from CodeCompose. To triangulate our numerical findings, we conduct a thematic analysis on the feedback from 70 developers. We find that 91.5% of the feedback is positive, with the most common themes being discovering APIs, dealing with boilerplate code, and accelerating coding. Meta continues to integrate this feedback into CodeCompose.

研究の動機と目的

  • Demonstrate how an enterprise code assistant can be fine-tuned on internal code and deployed at scale.
  • Explore system design choices, latency optimizations, and UX considerations for large-scale industrial deployment.
  • Assess the impact of CodeCompose on developer productivity and satisfaction through quantitative and qualitative metrics.
  • Identify challenges around trust, accuracy, and integration with existing IDEs in a large organization.

提案手法

  • Fine-tune an InCoder-based LLM on Meta's internal code (CM/LCM objectives).
  • Use a bidirectional, fill-in-the-middle style training objective (LCM) with metadata like language, file path, and kernel.
  • Deploy via a client-server architecture with an LSP-based language server and Thrift-backed GPU inference tier.
  • Measure performance with offline exact-match and BLEU metrics across languages; collect real-world usage metrics (acceptance rate, percentage of code typed from suggestions) and qualitative user feedback.
  • Implement latency-focused optimizations (caching, debouncing, minimal batching) and a randomized rollout strategy to reduce bias.
  • Provide telemetry and support for multiple editors through aReusable LSP component and in-house editor surfaces.
Figure 1 . CodeCompose (a) offers inline code suggestions in VSCode in a grey text when the user is typing code (Tab to accept), (b) changes its suggestion to adapt to a natural language comment, (c) suggests code or documentation based on code below the current position.
Figure 1 . CodeCompose (a) offers inline code suggestions in VSCode in a grey text when the user is typing code (Tab to accept), (b) changes its suggestion to adapt to a natural language comment, (c) suggests code or documentation based on code below the current position.

実験結果

リサーチクエスチョン

  • RQ1How does fine-tuning on Meta's internal code improve code suggestion accuracy across multiple languages?
  • RQ2What architectural and UX decisions enable scalable, low-latency AI-assisted code completion at a large company?
  • RQ3What is the real-world impact of CodeCompose in terms of acceptance rates, usage, and user satisfaction?
  • RQ4What challenges (trust, hallucinations, integration with existing tools) arise in industrial deployments of AI code assistants?
  • RQ5How does contextual information (code before/after cursor, file, kernel) influence model performance?

主な発見

Language# Suggestions shownAcceptance ratePercentage of code typed using CodeCompose# Users
Python1.87mn22810.7k
Hack1.25mn22.5105.5k
C++608.1k20102.5k
Flow (Javascript)583.2k18.272.5k
Rust74.2k17.29212
Objective C++56.5k186429
Objective C34.4k18.16299
C23.5k21.312201
Typescript8.9k191076
All4.5mn22816k
  • CodeCompose achieved a 22% acceptance rate across languages for displayed suggestions.
  • 8% of the code typed by developers came from accepted CodeCompose suggestions.
  • Qualitative feedback showed 91.5% favorable reception among surveyed users.
  • Fine-tuning on Meta's internal data significantly improved exact-match and BLEU scores across Hack, Python, Flow, and C++; LCM training further boosted performance.
  • The system supports at least 9 languages and shows substantial real-world usage with 4.5 million suggestions over 15 days.
  • UX decisions (single-line suggestions, 300-500ms latency) and a wave-based rollout were effective in maintaining trust and usability.
Figure 2 . Steps to construct an input to the model in LCM with an example.
Figure 2 . Steps to construct an input to the model in LCM with an example.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。