QUICK REVIEW

[논문 리뷰] Large Language Models in Fault Localisation

Yonghao Wu, Zheng Li|arXiv (Cornell University)|2023. 08. 29.

Software Engineering Research인용 수 18

한 줄 요약

본 연구는 Defects4J에서 fault localisation에 대해 ChatGPT-3.5와 ChatGPT-4를 평가하고, ChatGPT-4 (Log)가 함수 수준 컨텍스트에서 최상의 TOP-1 성능을 제공하지만 클래스 수준 컨텍스트에서는 효과가 크게 떨어짐; 보조 오류 로그가 정확도와 일관성을 개선함.

ABSTRACT

Large Language Models (LLMs) have shown promise in multiple software engineering tasks including code generation, program repair, code summarisation, and test generation. Fault localisation is instrumental in enabling automated debugging and repair of programs and was prominently featured as a highlight during the launch event of ChatGPT-4. Nevertheless, the performance of LLMs compared to state-of-the-art methods, as well as the impact of prompt design and context length on their efficacy, remains unclear. To fill this gap, this paper presents an in-depth investigation into the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Using the widely-adopted large-scale Defects4J dataset, we compare the two LLMs with the existing fault localisation techniques. We also investigate the consistency of LLMs in fault localisation, as well as how prompt engineering and the length of code context affect the fault localisation effectiveness. Our findings demonstrate that within function-level context, ChatGPT-4 outperforms all the existing fault localisation methods. Additional error logs can further improve ChatGPT models' localisation accuracy and consistency, with an average 46.9% higher accuracy over the state-of-the-art baseline SmartFL on the Defects4J dataset in terms of TOP-1 metric. However, when the code context of the Defects4J dataset expands to the class-level, ChatGPT-4's performance suffers a significant drop, with 49.9% lower accuracy than SmartFL under TOP-1 metric. These observations indicate that although ChatGPT can effectively localise faults under specific conditions, limitations are evident. Further research is needed to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.

연구 동기 및 목표

Defects4J의 실제 버그에 대해 ChatGPT-3.5와 ChatGPT-4의 결함 위치 추적 능력을 평가한다.
ChatGPT 기반 로컬라이제이션을 최첨단 베이스라인(SBFL, MBFL, SmartFL)과 비교한다.
프롬프트 설계와 코드 컨텍스트 길이가 로컬라이제이션 효과에 미치는 영향을 조사한다.
반복 실험 간 일관성을 평가하고 오류 로그와 컨텍스트가 로컬라이제이션에 미치는 영향을 테스트한다.
Defects4J를 넘어 StuDefects라는 더 최신 데이터셋으로 평가를 확장하여 과적합 문제를 다룬다.

제안 방법

Defects4J의 여섯 개 Java 프로젝트를 사용하여 결함 위치 추적 기법을 평가한다.
SBFL(Ochiai, Dstar), MBFL(Metallaxis 기반), SmartFL, 그리고 네 가지 ChatGPT 구성(Origin/Log 프롬프트를 사용하는 ChatGPT-3.5/4)을 비교한다.
LLM의 토큰 한도를 맞추기 위해 함수 수준의 코드 컨텍스트를 사용하고, ChatGPT의 컨텍스트를 기준선과 일치시킨다.
프롬프트에는 원본 버전(대상 함수만)과 로그 버전(대화 내 실패 로그 및 오류 메시지)을 포함한다.
TOP-N(TOP-1~TOP-5)으로 평가하고 다섯 번 반복 평균; 유의성 검정은 Wilcoxon 부호 순위 검정을 사용한다.
Defects4J를 넘어 StuDefects에 대한 일반화 가능성을 테스트하기 위해 분석을 StuDefects로 확장한다.

실험 결과

연구 질문

RQ1RQ1. 최신 기준선과 비교했을 때 ChatGPT의 결함 위치 추적 성능은 어떠한가?
RQ2RQ1.1 Defects4J에서 TOP-N 지표를 사용하여 ChatGPT-3.5와 ChatGPT-4의 성능은 어떠한가?
RQ3RQ1.2 반복 실험에서 ChatGPT의 결함 위치 추적의 일관성은 어떠한가?
RQ4RQ2 프롬프트 설계(구성 요소 포함/제외)가 ChatGPT의 결함 위치 추적 성능에 어떤 영향을 미치는가?
RQ5RQ3 프롬프트의 코드 컨텍스트 길이가 ChatGPT의 성능에 어떤 영향을 주는가?

주요 결과

ChatGPT-4 (Log)는 Defects4J에서 평균 TOP-1 23.13으로 최고치를 달성하며 SmartFL보다 46.9% 우수하다.
ChatGPT-4 (Log)는 Defects4J에서 TOP-1부터 TOP-5에 걸쳐 다른 구성보다 더 높은 TOP-N 값을 보인다.
ChatGPT-4 (Log)는 다른 방법들과의 겹침이 가장 크고, 다른 기준선으로 포착되지 않는 많은 고유한 결함을 식별한다.
함수 수준에서 클래스 수준으로 코드 컨텍스트를 확장하면 ChatGPT-4의 효과가 크게 감소하여 SmartFL에 비해 TOP-1이 49.9% 감소한다.
오류 로그를 프롬프트에 포함시키는 것이 가장 영향력 있는 구성 요소이며, 이를 생략하면 정확도가 25.6% 감소한다.
확장된 StuDefects 평가도 Defects4J와 일치하는 결과를 보이며, ChatGPT-4 (Log)는 TOP-1에서 약 52.9%의 기준선 우위를 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.