QUICK REVIEW

[논문 리뷰] RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Tianyang Liu, Canwen Xu|arXiv (Cornell University)|2023. 06. 05.

Software Engineering Research인용 수 11

한 줄 요약

RepoBench는 저장소 수준의 코드 자동완성 벤치마크를 도입하여 Python와 Java에서 여러 모델과 검색 전략을 대상으로 세 가지 작업—검색(Retrieval), 코드 완성(Code Completion), 파이프라인(Pipeline)—을 평가합니다.

ABSTRACT

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.

연구 동기 및 목표

다중 파일, 저장소 규모의 코드 완성 시스템을 단일 파일 벤치마크를 넘어 평가하는 격차를 해소한다.
끝-to-end 워크플로우를 평가하기 위해 세 개의 상호 연결된 작업(Retrieval, Code Completion, Pipeline)을 제공하는 벤치마크를 제공한다.
실제 저장소에서 교차 파일 및 긴 컨텍스트 코드를 모델이 다루는 방식에 대한 통찰을 제공한다.
저장소 수준의 코드 인텔리전에 대한 공정한 비교를 촉진하고 개선을 촉진한다.]
method:["Python 및 Java에 초점을 맞춘 GitHub 코드(학습)에서 데이터셋을 구성하고 새로 수집된 GitHub 저장소(테스트) 포함.","tree-sitter를 통해 교차 파일 의존성을 파싱하여 교차 파일 행과 해당 정의를 식별.","세 가지 작업 정의: RepoBench-R(검색), RepoBench-C(코드 완성), RepoBench-P(파이프라인).","RepoBench-C에 대한 30줄 한도로 교차 파일 맥락(수입으로부터의 맥락)과 파일 내 맥락(선행 라인)을 조합하여 프롬프트 생성.","조회에 대해 acc@k를 사용한 검색 평가, 완성 및 파이프라인에 대해 정확 일치(EM)와 편집 유사성(ES)을 사용한 평가.","Python/Java 변형과 함께 Lexical, Semantic retrieval(CodeBERT, UniXcoder), 대형 언어 모델(Codex, CodeGen, StarCoder)을 포함한 다양한 베이스라인을 테스트."]
research_questions:[

실험 결과

연구 질문

RQ1검색 전략이 다음 줄 예측을 지원하는 교차 파일 조각을 얼마나 잘 식별하는가?
RQ2교차 파일 및 파일 내 맥락을 사용하여 서로 다른 맥락 길이(XF-F, XF-R, IF)에서 모델이 다음 코드 줄을 얼마나 효과적으로 예측할 수 있는가?
RQ3검색 및 완성을 파이프라인으로 결합했을 때 엔드-투-엔드 성능은 어떠하며, 교차 파일 맥락 배치가 결과에 어떤 영향을 미치는가?
RQ4저장소 수준의 벤치마크가 Python 대 Java의 언어별 차이를 검색 및 완성 성능에서 드러내는가?

주요 결과

검색 방법	XF-F EM	XF-F ES	XF-R EM	XF-R ES	IF EM	IF ES	전체 EM	전체 ES
Gold-Only * (Python)	30.59	70.43	40.65	74.45	41.10	78.32	35.79	73.59
Gold-Filled-Head * (Python)	31.07	70.48	39.77	74.42	41.87	78.56	36.07	73.68
Gold-Filled-Tail * (Python)	31.35	70.73	40.56	74.37	41.21	78.50	36.18	73.77
UniXcoder-H2L (Python)	30.99	70.68	40.71	74.74	43.19	79.18	36.61	74.02
UniXcoder-L2H (Python)	32.12	71.36	40.59	74.48	43.07	79.22	37.11	74.32
Random (Python)	28.06	68.95	38.75	73.81	41.16	78.34	34.15	72.72
Baseline (Python)	26.75	68.16	37.60	73.30	40.79	78.26	33.15	72.20
Gold-Only * (Java)	32.48	70.39	43.51	76.49	55.91	81.65	41.62	74.95
Gold-Filled-Head * (Java)	32.48	70.29	43.44	76.36	55.84	81.68	41.59	74.88
Gold-Filled-Tail * (Java)	32.37	70.30	43.48	76.63	55.66	81.55	41.49	74.91
UniXcoder-H2L (Java)	32.46	70.16	42.79	76.14	56.71	81.86	41.70	74.83
UniXcoder-L2H (Java)	32.69	70.33	43.08	76.21	56.57	81.89	41.83	74.93
Random (Java)	31.34	69.81	42.08	75.73	55.94	81.67	40.77	74.51
Baseline (Java)	30.73	69.48	41.44	75.42	56.18	81.70	40.40	74.29

UniXcoder는 RepoBench-R에서 다른 검색 방법들을 꾸준히 상회하며 교차 파일 조각에 대한 강한 의미 이해도를 나타낸다.
Lexical(Jaccard) 검색은 일반적으로 교차 파일 조각의 관련성 측면에서 편집 유사성(ES)보다 우수하다.
Python 검색 작업이 방법별로 Java보다 더 높은 정확도를 보이는 경향이 있다.
RepoBench-P에서 교차 파일 맥락을 포함하면 설정에 관계없이 성능이 향상되며, 효과적인 검색(UniXcoder 등)은 검색 및 완성 결과를 모두 향상시킨다.
검색된 조각의 순서/배치가 코드 완성 효과에 영향을 주며, 대상 라인에 더 가까운 배치가 이점을 제공한다.]
table_headers:[

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.