QUICK REVIEW

[論文レビュー] RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Tianyang Liu, Canwen Xu|arXiv (Cornell University)|Jun 5, 2023

Software Engineering Research被引用数 11

ひとこと要約

RepoBench は、3つのタスク（取得、コード補完、パイプライン）を含む、コード自動補完のリポジトリレベルベンチマークを導入します。PythonとJavaで複数のモデルと取得戦略を評価します。

ABSTRACT

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.

研究の動機と目的

単一ファイルベンチマークを超えた、マルチファイル・リポジトリ規模のコード補完システムの評価ギャップを埋める。
3つの相互に関連するタスク（取得、コード補完、パイプライン）を備えたベンチマークを提供し、エンドツーエンドのワークフローを評価する。
実世界のリポジトリにおけるクロスファイルおよび長文コンテキストの扱いについて、モデルの挙動に洞察を与える。
リポジトリレベルのコードインテリジェンスの公正な比較を促進し、改善を推進する。

提案手法

GitHub のコード（トレーニング）と新たにクロールした GitHub リポジトリ（テスト）から、Python と Java に焦点を当てたデータセットを構築する。
tree-sitter によるクロスファイル依存関係を解析し、クロスファイル行と対応する定義を特定する。
3 つのタスクを定義する：RepoBench-R（取得）、RepoBench-C（コード補完）、RepoBench-P（パイプライン）。
クロスファイルコンテキスト（import からの）とファイル内コンテキスト（前続行）を組み合わせてプロンプトを作成し、RepoBench-C では30行のキャップを設定する。
取得には acc@k、補完およびパイプラインには Exact Match (EM) と Edit Similarity (ES) を評価する。
Python/Java バリアントを用いて、レキシカル、セマンティック取得（CodeBERT, UniXcoder）、大規模言語モデル（Codex, CodeGen, StarCoder）などのベースラインを幅広くテストする。

実験結果

リサーチクエスチョン

RQ1取得戦略が次の行の予測を支える関連クロスファイルスニペットをどれほど適切に特定できるか？
RQ2不同のコンテキスト長（XF-F、XF-R、IF）の下で、モデルがクロスファイルとファイル内コンテキストを用いて次のコード行をどれほど効果的に予測できるか？
RQ3取得と補完を組み合わせたパイプラインにおけるエンドツーエンドの性能はどの程度か、またクロスファイルコンテキストの配置が結果にどう影響するか？
RQ4リポジトリレベルのベンチマークは、取得と補完の性能において Python vs Java などの言語特有の差異を示すか？

主な発見

検索方法	XF-F EM	XF-F ES	XF-R EM	XF-R ES	IF EM	IF ES	全体 EM	全体 ES
Gold-Only * (Python)	30.59	70.43	40.65	74.45	41.10	78.32	35.79	73.59
Gold-Filled-Head * (Python)	31.07	70.48	39.77	74.42	41.87	78.56	36.07	73.68
Gold-Filled-Tail * (Python)	31.35	70.73	40.56	74.37	41.21	78.50	36.18	73.77
UniXcoder-H2L (Python)	30.99	70.68	40.71	74.74	43.19	79.18	36.61	74.02
UniXcoder-L2H (Python)	32.12	71.36	40.59	74.48	43.07	79.22	37.11	74.32
Random (Python)	28.06	68.95	38.75	73.81	41.16	78.34	34.15	72.72
Baseline (Python)	26.75	68.16	37.60	73.30	40.79	78.26	33.15	72.20
Gold-Only * (Java)	32.48	70.39	43.51	76.49	55.91	81.65	41.62	74.95
Gold-Filled-Head * (Java)	32.48	70.29	43.44	76.36	55.84	81.68	41.59	74.88
Gold-Filled-Tail * (Java)	32.37	70.30	43.48	76.63	55.66	81.55	41.49	74.91
UniXcoder-H2L (Java)	32.46	70.16	42.79	76.14	56.71	81.86	41.70	74.83
UniXcoder-L2H (Java)	32.69	70.33	43.08	76.21	56.57	81.89	41.83	74.93
Random (Java)	31.34	69.81	42.08	75.73	55.94	81.67	40.77	74.51
Baseline (Java)	30.73	69.48	41.44	75.42	56.18	81.70	40.40	74.29

UniXcoder は RepoBench-R で他の取得方法を一貫して上回り、クロスファイルスニペットの意味的理解力が高いことを示す。
レキシカル（Jaccard）取得は、クロスファイルスニペットの関連性において一般に Edit Similarity を上回る。
Python の取得タスクは、方法間で Java より高い精度を示す傾向がある。
RepoBench-P では、クロスファイルコンテキストを組み込むと設定全体で性能が向上し、効果的な取得（例：UniXcoder）により取得と補完の結果の両方が改善される。
取得済みスニペットの順序・配置がコード補完の有効性に影響を与え、対象行に近い位置の方が効果を高める。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。