QUICK REVIEW

[논문 리뷰] Boosting LLMs for Mutation Generation

Bo Wang, Ming Deng|arXiv (Cornell University)|2026. 03. 25.

Software Testing and Debugging Techniques인용 수 0

한 줄 요약

SMART는 검색 보강 생성(RAG), 코드 청킹, 지도 학습 미세 조정을 통합하여 LLM 기반 돌연변이 생성의 품질을 향상시키고, 더 높은 돌연변이 타당성 및 효과성을 달성하며, 소형 모델이 GPT-4o 성능에 근접하도록 한다.

ABSTRACT

LLM-based mutation testing is a promising testing technology, but existing approaches typically rely on a fixed set of mutations as few-shot examples or none at all. This can result in generic low-quality mutations, missed context-specific mutation patterns, substantial numbers of redundant and uncompilable mutants, and limited semantic similarity to real bugs. To overcome these limitations, we introduce SMART (Semantic Mutation with Adaptive Retrieval and Tuning). SMART integrates retrieval-augmented generation (RAG) on a vectorized dataset of real-world bugs, focused code chunking, and supervised fine-tuning using mutations coupled with real-world bugs. We conducted an extensive empirical study of SMART using 1,991 real-world Java bugs from the Defects4J and ConDefects datasets, comparing SMART to the state-of-the-art LLM-based approaches, LLMut and LLMorpheus. The results reveal that SMART substantially improves mutation validity, effectiveness, and efficiency (even enabling small-scale 7B-scale models to match or even surpass large models like GPT-4o). We also demonstrate that SMART significantly improves downstream software engineering applications, including test case prioritization and fault localization. More specifically, SMART improves validity (weighted average generation rate) from 42.89% to 65.6%. It raises the non-duplicate rate from 87.38% to 95.62%, and the compilable rate from 88.85% to 90.21%. In terms of effectiveness, it achieves a real bug detection rate of 92.61% (vs. 57.86% for LLMut) and improves the average Ochiai coefficient from 25.61% to 38.44%. For fault localization, SMART ranks 64 more bugs as Top-1 under MUSE and 57 more under Metallaxis.

연구 동기 및 목표

현실 세계의 버그를 반영하도록 돌연변이 생성 품질 향상을 자극한다.
실제 버그 데이터의 활용을 통해 맥락 인지 돌연변이 생성을 개발한다.
무효이거나 중복되거나 컴파일 불가한 돌연변이를 줄이고 의미적 연관성을 높인다.
더 작은 모델이 더 큰 LLM과 경쟁력 있는 성능을 달성하도록 한다.
테스트 케이스 우선순위 결정 및 결함 위치 추정에서의 다운스트림 이점을 시연한다.

제안 방법

130,000개의 Java 버그 데이터 세트를 벡터화한 버그-수정 데이터셋에 대해 회수 보강 생성(RAG) 파이프라인을 구축한다.
초점 메서드를 의미론적으로 일관된 청크로 분해하기 위해 로직 기반 코드 청크 분할을 적용한다.
LLM 주도 돌연변이 생성을 위한 작업 특화 프롬프트 및 맥락 통합을 설계한다.
현실 버그와 연결된 13,760개의 돌연변이로 감독 학습을 통해 LLM을 미세 조정한다.
7B 및 GPT-4o를 포함한 여러 모델을 사용하여 Defects4J 및 ConDefects의 1,991개의 실제 Java 결함에 대해 평가한다.

Figure 1 . The Overview of Mutation Generation Process of SMART

실험 결과

연구 질문

RQ1RQ1: SMART가 기존 접근 방식보다 더 많은 유효한 돌연변이를 생성하는가?
RQ2RQ2: SMART 돌연변이가 기준선보다 실제 버그를 더 닮았는가?
RQ3RQ3: SMART가 돌연변이 기반 테스트 케이스 우선순위 결정 성능에 어떤 영향을 미치는가?
RQ4RQ4: SMART가 돌연변이 기반 결함 위치 추정 성능에 어떤 영향을 미치는가?
RQ5RQ5: 각 SMART 구성요소(RAG, 청크 분할, 미세 조정)의 기여를 애블레이션으로 어떻게 나타나는가?

주요 결과

유효성 향상: 가중 평균 생성 비율이 42.89% (LLMut)에서 65.6%로 증가.
중복되지 않은 비율이 87.38% (LLMut) 및 85.87% (LLMorpheus)에서 95.62%로 상승.
컴파일 가능 비율이 88.85% (LLMut) 및 78.43% (LLMorpheus)에서 90.21%로 향상.
효과성: 실제 버그 탐지율이 92.61%에 이르고 (LLMut의 57.86%, LLMorpheus의 31.99% 대비) 높아졌다.
Ochiai 계수가 38.44%로 상승(AOC 개선이 큰 폭으로 나타남).
다운스트림 이득: MUSE(64) 및 Metallaxis(57)에서 Top-1 버그 순위가 더 많이 나타났으며, 7B 규모의 모델이 GPT-4o의 성능에 근접한다.

Figure 2 . The Example Mutation of SMART

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.