QUICK REVIEW

[論文レビュー] A Survey on Large Language Models for Code Generation

J.-H.R. Jiang, Fan Wang|arXiv (Cornell University)|Jun 1, 2024

Natural Language Processing Techniques被引用数 54

ひとこと要約

大規模言語モデルがNL-to-code生成にどのように使用されるかを網羅的に体系的にレビューし、分類法を導入し、データ、訓練、評価、および実用的な応用を調査し、HumanEvalとMBPPのベンチマークを用いた経験的リファレンスを提供する。

ABSTRACT

Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. Despite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language processing (NLP) or software engineering (SE) or both, there is a noticeable absence of a comprehensive and up-to-date literature review dedicated to LLM for code generation. In this survey, we aim to bridge this gap by providing a systematic literature review that serves as a valuable reference for researchers investigating the cutting-edge progress in LLMs for code generation. We introduce a taxonomy to categorize and discuss the recent developments in LLMs for code generation, covering aspects such as data curation, latest advances, performance evaluation, ethical implications, environmental impact, and real-world applications. In addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the HumanEval, MBPP, and BigCodeBench benchmarks across various levels of difficulty and types of programming tasks to highlight the progressive enhancements in LLM capabilities for code generation. We identify critical challenges and promising opportunities regarding the gap between academia and practical development. Furthermore, we have established a dedicated resource GitHub page (https://github.com/juyongjiang/CodeLLMSurvey) to continuously document and disseminate the most recent advances in the field.

研究の動機と目的

LLM 内での NL-to-code としてのコード生成を定義し、その重要性を正当化する。
コード生成のためのデータ、訓練、評価、適用を整理する分類法を提示する。
コードLLMにおけるデータキュレーション、事前訓練、ファインチューニング、プロンプティングなどの発展を総合する。
研究と実践を結びつける上での課題と機会を批判的に論じる。
進展を示すために HumanEval および MBPP ベンチマークを用いて経験的な文脈を提供する。

提案手法

コード生成のデータキュレーション、訓練、評価、および適用を分類する分類法を提案する。
データ事前訓練、指示チューニング、ファインチューニングにおけるデータソース、ベンチマーク、およびモデルファミリーを調査する。
検索補強、自律的コーディングエージェント、および LLM を審判としての活用といった高度なトピックを要約する。
評価指標と Copilot や CodeWhisperer のような産業ツールにおける実用的な応用を論じる。
よく知られたベンチマーク（HumanEval、MBPP）での進展を参照することによる経験的視点を提供する。
継続的な更新のためのリソースページ（codellm.github.io）を提供する。

実験結果

リサーチクエスチョン

RQ1コード生成を定義するLLMの核心要素と段階（データ、訓練、評価、デプロイメント）は何か？
RQ2データキュレーション、モデルアーキテクチャ、およびファインチューニング戦略は、NL-to-code能力を向上させるためにどのように進化してきたか？
RQ3現実世界のコード生成性能を最もよく反映するベンチマークと評価手法は何か？
RQ4研究と実践の翻訳を妨げる課題は何か、そしてそれにどう対処できるか？
RQ5コード中心のLLMsにおける最も有望な機会と今後の方向性は何か？

主な発見

コードLLMsはコード生成で顕著な進展を示しており、モデルの規模が拡大するにつれて指示追従やインコンテキスト学習といった新たな能力が現れている。
HumanEval の進展が挙げられており、Pass@1で 3.6% (PaLM 8B) から 95.1% (LDB) へと移行し、標準的な NL-to-code タスクで劇的な性能向上を示している。
StarCoder、Code LLaMA、CodeGemma などの専門コードモデルの広いエコシステムが、コードタスクのための汎用LLMsを補完している。
データキュレーション、事前訓練、指示チューニング、評価、プロンプト、検索補強生成、そして自律エージェントを、コード生成パイプラインの不可欠な側面として網羅する広範な分類法。
本論は継続中のデータ品質、プライバシー、アライメントの課題を指摘し、学術界と産業実践を橋渡しする機会を概説している。
進展を文書化・普及するための専用リソースサイト codellm.github.io が確立されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。