QUICK REVIEW

[논문 리뷰] CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

Yangruibo Ding, Zijian Wang|arXiv (Cornell University)|2023. 10. 17.

Software Engineering Research인용 수 10

한 줄 요약

CrossCodeEval은 Python, Java, TypeScript, C# 전반에 걸친 교차 파일 다국어 코드 완성 벤치마크를 도입하며, 정적 분석을 사용해 교차 파일 맥락을 요구합니다. 이는 CodeGen, StarCoder, GPT-3.5-Turbo를 인-파일, 검색, 참조 포함 검색 프롬프트로 평가하며, 코드 언어 모델에서 교차 파일 맥락의 중요성과 도전을 강조합니다.

ABSTRACT

Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

연구 동기 및 목표

실제 소프트웨어 저장소에서 교차 파일 맥락으로 코드 완성 평가의 필요성을 동기화한다.
실제 오픈 소스 저장소에서 교차 파일 완성 요구 사항이 엄격한 CrossCodeEval를 만든다.
메모라이제이션 혼란을 피하기 위한 훈련 데이터의 최소 중복을 보장한다.
교차 파일 맥락 검색 능력을 평가할 수 있는 벤치마크를 제공한다.

제안 방법

네 가지 언어로 된 허가된 라이선스의 GitHub 저장소에서 교차 파일 코드 완성 데이터셋을 구성한다.
정적 분석을 사용해 교차 파일 맥락을 요구하는 프롬프트를 생성하기 위해 교차 파일 맥락 의존성을 자동으로 식별한다.
데이터의 품질 및 누출 감소를 위한 후처리 및 필터링, 검색 기반 검증 단계를 포함한다.
zero-shot 설정에서 인-파일, 검색 및 참조 포함 검색 프롬프트로 여러 코드 LM(CodeGen 최대 크기들, StarCoder, GPT-3.5-Turbo)을 평가한다.
성능을 평가하기 위한 두 가지 지표: 코드 일치(정확 일치 및 편집 유사도)와 식별자 일치(EM 및 F1)를 사용한다.

실험 결과

연구 질문

RQ1교차 파일 맥Context가 주요 프로그래밍 언어에서 코드 완성 성능을 향상시키는가?
RQ2검색 기반 방법이 교차 파일 코드 완성에서 인-파일 맥락과 비교했을 때 어떤 차이가 있는가?
RQ3참조 유도 교차 파일 맥 Context를 사용한 검색에서 성능의 상한선은 무엇인가?
RQ4다른 검색기(BM25, 신경망 검색기)가 이 벤치마크의 교차 파일 코드 검색 작업에서 어떤 성능을 보이는가?
RQ5CrossCodeEval이 코드 검색 품질 및 전략의 벤치마크로 작용할 수 있는가?

주요 결과

모델	파이썬 EM	파이썬 ES	자바 EM	자바 ES	타입스크립트 EM	타입스크립트 ES	C# EM	C# ES
CodeGen25-7B (In-file)	7.73	59.34	10.43	62.05	7.81	57.56	4.36	58.99
CodeGen25-7B Retrieval	14.52	64.40	16.88	64.35	12.57	60.08	13.01	63.86
CodeGen25-7B Retrieval w/ Ref.	19.17	67.46	20.20	66.17	15.35	62.73	17.87	66.14
StarCoder-15.5B (In-file)	8.82	61.08	9.96	63.25	6.35	51.22	4.47	59.80
StarCoder-15.5B Retrieval	15.72	66.28	17.48	66.10	8.31	44.87	13.57	65.00
StarCoder-15.5B Retrieval w/ Ref.	21.01	68.66	19.92	67.75	11.02	46.67	20.08	67.97
GPT-3.5-turbo (In-file)	4.88	52.58	12.30	63.52	6.38	53.78	3.56	56.48
GPT-3.5-turbo Retrieval	10.77	54.92	19.12	65.61	10.94	55.83	11.82	62.40
GPT-3.5-turbo Retrieval w/ Ref.	15.72	58.88	22.72	68.50	14.15	58.40	17.65	66.07

모델은 인-파일 맥락만으로는 성능이 낮아 교차 파일 맥 context의 필요성을 보여준다.
교차 파일 맥락을 추가하면 언어와 모델 전반에서 상당한 성능 향상이 나타난다.
교차 파일 맥락이 있는 검색은 특정 모델과 언어에서 코드 정확 일치를 최대 약 3배까지 개선하고, 참조를 포함한 검색을 사용할 때 상한 효과를 보여준다.
검색된 교차 파일 맥락이 있어도 성능은 여전히 완벽에 가깝지 않으며, 저장소 수준의 광범위한 맥락 활용에 여지가 있음을 시사한다.
검색 방법의 품질이 결과에 큰 영향을 미치며, OpenAI ada 임베딩이 일부 언어에서 Sparse BM25보다 종종 더 우수하지만, 최적의 상한에 비해 여전히 미흡하여 더 나은 교차 파일 검색 기술이 필요함을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.