QUICK REVIEW

[논문 리뷰] A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

Quanjun Zhang, Tongke Zhang|arXiv (Cornell University)|2023. 10. 13.

Software Engineering Research인용 수 31

한 줄 요약

본 논문은 AtCoder의 보지 않은 Java 버그에 대해 ChatGPT의 자동화된 프로그램 수리 능력을 평가하기 위해 EvalGPTFix를 구성하고, 기본 프롬프트로 151개 버그 중 109개를 수정했으며, 향상된 프롬프트와 대화를 사용할 경우 최대 143개 버그를 수정해 CodeT5와 PLBART를 능가함을 보인다. 또한 프롬프트 설계, 대화 기반 수리 및 SE에서의 블랙박스 LLM의 데이터 누출 우려를 분석한다.

ABSTRACT

Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of Software Engineering (SE) tasks, such as Automated Program Repair (APR), code summarization, and code completion. For example, ChatGPT, the latest black-box LLM, has been investigated by numerous recent research studies and has shown impressive performance in various tasks. However, there exists a potential risk of data leakage since these LLMs are usually close-sourced with unknown specific training details, e.g., pre-training datasets. In this paper, we seek to review the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives. We first introduce {\benchmark}, a new benchmark with buggy and the corresponding fixed programs from competitive programming problems starting from 2023, after the training cutoff point of ChatGPT. The results on {\benchmark} show that ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5\% and 62.4\% prediction accuracy. We also investigate the impact of three types of prompts, i.e., problem description, error feedback, and bug localization, leading to additional 34 fixed bugs. Besides, we provide additional discussion from the interactive nature of ChatGPT to illustrate the capacity of a dialog-based repair workflow with 9 additional fixed bugs. Inspired by the findings, we further pinpoint various challenges and opportunities for advanced SE study equipped with such LLMs (e.g.,~ChatGPT) in the near future. More importantly, our work calls for more research on the reevaluation of the achievements obtained by existing black-box LLMs across various SE tasks, not limited to ChatGPT on APR.

연구 동기 및 목표

EvalGPTFix로 깨끗하고 미확인 APR 벤치마크에서 ChatGPT의 버그 수정 효과를 평가한다.
다양한 프롬프트(문제 설명, 오류 정보, 버그 위치)가 수리 성능에 어떤 영향을 미치는지 조사한다.
대화 기반 상호작용이 ChatGPT를 통한 반복적 버그 수정 성능을 향상시키는지 탐구한다.

제안 방법

EvalGPTFix를 구축: AtCoder 대회(2023)에서 나온 151쌍의 버그-정상 Java 쌍을 테스트 케이스 기반 검증과 unseen 데이터를 보장하기 위한 정적/동적 필터링으로 구성.
ChatGPT(gpt-3.5-turbo)를 사용하여 각 버그를 최대 35라운드까지 반복 프롬프트로 수정하고, 연속 3라운드 동안 새로운 수정이 없으면 중단한다.
FixEval 데이터로 세밀하게 튜닝하고 AtCoder 테스트 스위트를 대상으로 패치를 평가하여 CodeT5 및 PLBART와 같은 최신 LLM을 벤치마크한다.
(a) 문제 설명, (b) 오류 정보, (c) 버그 위치, (d) 대화형 상호작용을 추가하여 프롬프트를 평가하고 추가로 수정된 버그를 측정한다.
회수율, 버그 유형별 수정율 및 모델 간 교차 중복을 보고하여 상대적 강점을 평가한다.

실험 결과

연구 질문

RQ1RQ1: EvalGPTFix에서 버그가 있는 프로그램을 ChatGPT가 수정하는 효과는 어느 정도인가?
RQ2RQ2: 서로 다른 프롬프트가 ChatGPT의 수리 성능에 어떤 영향을 미치는가?
RQ3RQ3: 대화 기반 상호작용이 ChatGPT의 수리 결과를 더 개선할 수 있는가?

주요 결과

ChatGPT는 EvalGPTFix에서 기본 프롬프트로 151개 버그 중 109개를 수정했다.
문제 설명, 오류 정보 및 버그 위치를 추가하면 각각 추가로 18개, 25개, 10개 버그를 수정했다.
대화로 인해 프롬프트 기반 시도를 넘어 9개의 추가 수정이 있었다.
전반적으로 ChatGPT는 EvalGPTFix에서 143개의 버그를 수정했고, 실세계의 버그가 있는 프로그램을 수리할 가능성이 강함을 시사한다.
CodeT5는 79개 버그를, PLBART는 41개 버그를 수정했고, 본 연구에서 ChatGPT의 우수한 수리 능력을 보여준다.
ChatGPT 출력에는 뚜렷한 무작위성이 있어 결과를 안정시키려면 최대 35회의 다중 라운드가 필요하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.