QUICK REVIEW

[論文レビュー] Boosting LLMs for Mutation Generation

Bo Wang, Ming Deng|arXiv (Cornell University)|Mar 25, 2026

Software Testing and Debugging Techniques被引用数 0

ひとこと要約

SMART は retrieval-augmented generation、コード分割、監督付きファインチューニングを統合することで LLM ベースの変異生成を改善し、変異の妥当性と有効性を高め、小規模モデルが GPT-4o に近い性能を発揮する。

ABSTRACT

LLM-based mutation testing is a promising testing technology, but existing approaches typically rely on a fixed set of mutations as few-shot examples or none at all. This can result in generic low-quality mutations, missed context-specific mutation patterns, substantial numbers of redundant and uncompilable mutants, and limited semantic similarity to real bugs. To overcome these limitations, we introduce SMART (Semantic Mutation with Adaptive Retrieval and Tuning). SMART integrates retrieval-augmented generation (RAG) on a vectorized dataset of real-world bugs, focused code chunking, and supervised fine-tuning using mutations coupled with real-world bugs. We conducted an extensive empirical study of SMART using 1,991 real-world Java bugs from the Defects4J and ConDefects datasets, comparing SMART to the state-of-the-art LLM-based approaches, LLMut and LLMorpheus. The results reveal that SMART substantially improves mutation validity, effectiveness, and efficiency (even enabling small-scale 7B-scale models to match or even surpass large models like GPT-4o). We also demonstrate that SMART significantly improves downstream software engineering applications, including test case prioritization and fault localization. More specifically, SMART improves validity (weighted average generation rate) from 42.89% to 65.6%. It raises the non-duplicate rate from 87.38% to 95.62%, and the compilable rate from 88.85% to 90.21%. In terms of effectiveness, it achieves a real bug detection rate of 92.61% (vs. 57.86% for LLMut) and improves the average Ochiai coefficient from 25.61% to 38.44%. For fault localization, SMART ranks 64 more bugs as Top-1 under MUSE and 57 more under Metallaxis.

研究の動機と目的

現実世界のバグを反映するように変異生成品質の改善を動機づける。
実世界のバグデータを活用して文脈認識的な変異生成を開発する。
無効・重複・組み立て不能なミュータントを減らし、意味的関連性を高める。
小規模モデルが大規模 LLM と競争力のある性能を達成できるようにする。
テストケース優先順位付けと故障 Localization への下流の利点を示す。

提案手法

130,000 個の Java バグのベクトル化されたデータセットに対して retrieval-augmented generation (RAG) パイプラインを構築する。
焦点メソッドを意味的に一貫したチャンクに分解するための論理ベースのコード分割を適用する。
LLM 主導の変異生成のためのタスク固有のプロンプトと文脈統合を設計する。
現実のバグと結びついた 13,760 個の変異で監督学習を用いて LLM をファインチューニングする。
7B および GPT-4o を含む複数のモデルを用いて Defects4J および ConDefects の 1,991 件の実際の Java 欠陥を評価する。

Figure 1 . The Overview of Mutation Generation Process of SMART

実験結果

リサーチクエスチョン

RQ1RQ1: SMART は既存のアプローチより妥当な変異を多く生成するか。
RQ2RQ2: SMART の変異はベースラインより実際のバグに近いか。
RQ3RQ3: SMART は変異ベースのテストケース優先順位付けの性能にどう影響するか。
RQ4RQ4: SMART は変異ベースの故障 Localization の性能にどう影響するか。
RQ5RQ5: Ablation による各 SMART コンポーネント（RAG、分割、ファインチューニング）の寄与はどれか。

主な発見

妥当性の向上: 重み付き平均生成率は LLMut の 42.89% から 65.6% に増加。
重複なし率は LLMut の 87.38%、LLMorpheus の 85.87% から 95.62% に上昇。
組み立て可能性の向上: LLMut の 88.85%、LLMorpheus の 78.43% から 90.21%。
有効性: 実際のバグ検出率が 92.61% に達し、LLMut の 57.86%、LLMorpheus の 31.99% を上回る。
Ochiai 指標が 38.44% に上昇（AOC の改善が大きい）。
下流の利得: MUSE（Top-1 バグランキング 64）と Metallaxis（57）でより高い上位ランクを達成; 7B スケールのモデルが GPT-4o の性能に匹敵。

Figure 2 . The Example Mutation of SMART

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。