QUICK REVIEW

[논문 리뷰] Practical Program Repair in the Era of Large Pre-trained Language Models

Chunqiu Steven Xia, Yuxiang Wei|arXiv (Cornell University)|2022. 10. 25.

Software Engineering Research인용 수 29

한 줄 요약

이 논문은 다중 데이터셋과 언어에 걸쳐 자동 프로그램 수리를 위한 최첨단 대형 PLM의 최초의 광범위한 평가를 수행하여 PLM이 기존 APR 도구를 능가할 수 있음을 보여주고, 더 큰 모델일수록 일반적으로 더 잘 수행하며, 인필링/접미사(context)로 인해 패치 품질이 향상된다.

ABSTRACT

Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited patch variety, failing to fix complicated bugs. This is mainly due to the reliance on bug-fixing datasets to craft fix templates or directly predict potential patches. Large Pre-Trained Language Models (PLMs), trained using billions of text/code tokens, can potentially help avoid this issue. Very recently, researchers have directly leveraged PLMs for APR without relying on any bug-fixing datasets. Meanwhile, such existing work either failed to include state-of-the-art PLMs or was not evaluated on realistic datasets. In this work, we perform the first extensive study on directly applying PLMs for APR. We select 9 recent state-of-the-art PLMs, including both generative and infilling models, ranging from 125M to 20B in size. We designed 3 different repair settings to evaluate the different ways we can use PLMs to generate patches. We apply the PLMs under these repair settings on 5 datasets across 3 different languages and compare different PLMs in the number of bugs fixed, generation speed and compilation rate. Our study demonstrates that directly applying state-of-the-art PLMs can already substantially outperform all existing APR techniques on all our datasets. Among the studied PLMs, the scaling effect exists for APR where larger models tend to achieve better performance. Also, we show for the first time that suffix code after the buggy line (adopted in infilling-style APR) is important in not only generating more fixes but more patches with higher compilation rate. Besides patch generation, the PLMs consider correct patches to be more natural than other ones, and can even be leveraged for effective patch ranking or patch correctness checking.

연구 동기 및 목표

다양한 대형 PLM이 다수의 데이터셋과 언어에 걸쳐 자동 프로그램 수리에 대해 어떻게 수행하는지 평가한다.
PLM-기반 APR을 기존의 최첨단 전통 및 학습 기반 APR 도구와 비교한다.
수리 설정(완전 함수 생성, 인필링, 단일 행 생성)이 패치 품질과 속도에 미치는 영향을 조사한다.
PLM 유래 지표인 엔트로피를 사용한 패치 순위 매김 및 정확성 확인을 탐구한다.
PLM 기반 APR 성능을 높이기 위한 실용적 지침(샘플 크기, 수정 템플릿)을 식별한다.

제안 방법

5개의 실제 수리 데이터셋(Java, Python, C)에서 9개의 대형 PLM(125M–20B 매개변수), 생성형 및 인필링 모델을 평가한다.
세 가지 수리 설정: 완전 함수 생성, 올바른 코드 인필링, 그리고 단일 행 생성.
버그 수정 학습 데이터 없이 PLMs가 패치를 생성할 수 있도록 프롬프트와 소수샷 예제를 사용한다.
건 nucleus 샘플링(top-p, 온도)을 사용하여 버그당 여러 패치를 생성하고 패치를 엔트로피로 순위 매긴다.
유효 패치를 테스트 스위트를 실행하여 그럴듯한 패치와 올바른 패치를 구분한다.
PLM 기반 APR과 20개의 베이스라인 APR 도구(학습 기반 및 전통적)를 비교한다.

실험 결과

연구 질문

RQ1다양한 유형과 크기의 PLM이 각 APR 설정에서 데이터셋과 언어 별로 어떻게 수행하는가?
RQ2PLMs가 실제 버그에서 최첨단 APR 도구를 능가하는가?
RQ3엔트로피를 통한 패치 순위 매김 및 정확성 확인에 PLMs을 효과적으로 사용할 수 있는가?
RQ4더 많은 샘플(샘플 수 증가)이나 수정 템플릿 도입과 같은 전략이 PLM 기반 APR 성능을 더 향상시키는가?

주요 결과

대형 PLM일수록 일반적으로 더 많은 정확한 및 그럴듯한 패치를 제공한다(스케일링 효과).
Codex는 코드 중심의 사전학습 및 튜닝으로 여러 설정에서 종종 다른 모델보다 우수하다.
접미사 맥락(접두사+접미사)으로 인필링은 수정 수와 패치 컴파일 성공률을 모두 향상시킨다.
접미사 맥락이 있을 때 인필링 모델이 단일 행 및 인필링 작업에서 생성형 모델보다 성능이 우수하다.
올바른 코드 인필링 또는 단일 행 생성을 사용하는 경우 완전 함수 생성보다 정확도가 높은 패치 비율을 보인다.
대형 모델에서 패치 생성 속도가 느려지지만, Codex는 일부 데이터셋에서 더 느린 추론에도 불구하고 강한 수리 능력을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.