QUICK REVIEW

[論文レビュー] CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

Yangruibo Ding, Zijian Wang|arXiv (Cornell University)|Oct 17, 2023

Software Engineering Research被引用数 10

ひとこと要約

CrossCodeEval は、Python、Java、TypeScript、および C# に跨る、ファイル間の文脈を要求するクロスファイルの多言語コード補完ベンチマークを導入します。 static-analysis を用いてクロスファイルの文脈を要求し、in-file、retrieval、および retrieval-with-reference のプロンプトで CodeGen、StarCoder、GPT-3.5-Turbo を評価し、コード言語モデルに対するクロスファイル文脈の重要性と難しさを強調します。

ABSTRACT

Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

研究の動機と目的

現実的なソフトウェアリポジトリにおいて、クロスファイル文脈を用いたコード補完の評価の必要性を動機づける。
実在のオープンソースリポジトリから、厳格なクロスファイル補完要件を満たす CrossCodeEval を作成する。
記憶化の混乱を避けるため、トレーニングデータとの最小限の重複を確保する。
クロスファイル文脈の検索機能も評価できるベンチマークを提供する。

提案手法

4言語で許容的にライセンスされたGitHubリポジトリから、クロスファイルのコード補完データセットを構築する。
静的解析を用いて自動的にクロスファイルの文脈依存関係を特定し、クロスファイル文脈を必要とするプロンプトを生成する。
品質のための後処理とフィルタリングを実施し、漏洩を減らす。 retrieval-based verification step を含める。
CodeGen（複数サイズ）、StarCoder、GPT-3.5-Turbo を含む複数のコードLMを、ゼロショット設定で、in-file、retrieved、retrieved-with-reference のプロンプトで評価する。
パフォーマンス評価には、コード一致（完全一致と編集距離類似度）と識別子一致（EMとF1）の2つの指標を使用する。

実験結果

リサーチクエスチョン

RQ1主要なプログラミング言語全体で、クロスファイル文脈はコード補完の性能を向上させるか？
RQ2クロスファイルコード補完において、取得ベースの方法はin-file文脈とどのように比較されるか？
RQ3参照由来のクロスファイル文脈を取得に使用した場合の性能の上限はどれくらいか？
RQ4このベンチマーク内のクロスファイルコード検索タスクで、異なるリトリーバ（BM25、ニューラルリトリーバ）はどのように機能するか？
RQ5CrossCodeEval はコード取得の品質と戦略のベンチマークとして機能するか？

主な発見

Model	Python EM	Python ES	Java EM	Java ES	TypeScript EM	TypeScript ES	C# EM	C# ES
CodeGen25-7B (In-file)	7.73	59.34	10.43	62.05	7.81	57.56	4.36	58.99
CodeGen25-7B Retrieval	14.52	64.40	16.88	64.35	12.57	60.08	13.01	63.86
CodeGen25-7B Retrieval w/ Ref.	19.17	67.46	20.20	66.17	15.35	62.73	17.87	66.14
StarCoder-15.5B (In-file)	8.82	61.08	9.96	63.25	6.35	51.22	4.47	59.80
StarCoder-15.5B Retrieval	15.72	66.28	17.48	66.10	8.31	44.87	13.57	65.00
StarCoder-15.5B Retrieval w/ Ref.	21.01	68.66	19.92	67.75	11.02	46.67	20.08	67.97
GPT-3.5-turbo (In-file)	4.88	52.58	12.30	63.52	6.38	53.78	3.56	56.48
GPT-3.5-turbo Retrieval	10.77	54.92	19.12	65.61	10.94	55.83	11.82	62.40
GPT-3.5-turbo Retrieval w/ Ref.	15.72	58.88	22.72	68.50	14.15	58.40	17.65	66.07

モデルはin-file 文脈だけでは性能が低く、クロスファイル文脈の必要性を示している。
クロスファイル文脈を追加すると、言語とモデルを問わず顕著な性能向上が得られる。
クロスファイル文脈を用いた取得は、モデルと言語によっては完全一致を約3倍近く改善し、参照を用いた取得でさらに改善される。上限効果を示している。
取得済みのクロスファイル文脈があっても性能はまだ完璧ではなく、リポジトリ全体の文脈を活用する改善余地がある。
取得法の品質が結果に大きく影響し、OpenAI ada の埋め込みは特定の言語で稀少な BM25 を上回ることが多いが、最良のリトリーバでも理想的な上限には及ばず、より良いクロスファイル取得技術の必要性を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。