[论文解读] A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings
该论文使用差分模糊测试(Eq@DFuzz)来评估六个模型、三个数据集、两种重构类型下由LLM生成的代码重构在功能上的等价性,揭示了显著的非等价性以及基于测试集评估的空白。
With the rapid adoption of large language models (LLMs) in automated code refactoring, assessing and ensuring functional equivalence between LLM-generated refactoring and the original implementation becomes critical. While prior work typically relies on predefined test cases to evaluate correctness, in this work, we leverage differential fuzzing to check functional equivalence in LLM-generated code refactorings. Unlike test-based evaluation, a differential fuzzing-based equivalence checker needs no predefined test cases and can explore a much larger input space by executing and comparing thousands of automatically generated test inputs. In a large-scale evaluation of six LLMs (CodeLlama, Codestral, StarChat2, Qwen-2.5, Olmo-3, and GPT-4o) across three datasets and two refactoring types, we find that LLMs show a non-trivial tendency to alter program semantics, producing 19-35% functionally non-equivalent refactorings. Our experiments further demonstrate that about 21% of these non-equivalent refactorings remain undetected by the existing test suites of the three evaluated datasets. Collectively, the findings of this study imply that reliance on existing tests might overestimate functional equivalence in LLM-generated code refactorings, which remain prone to semantic divergence.
研究动机与目标
- 推动超越测试通过指标的LLM驱动代码重构可靠评估。
- 评估来自多种LLM在不同数据集上产生的重构的功能等价性。
- 显示传统测试套件可能错过相当比例的语义差异。
提出的方法
- 使用六个LLM(CodeLlama, Codestral, StarChat2, Qwen-2.5, Olmo-3, GPT-4o)生成重构。
- 对两种提示(性能优化和代码简化)应用,生成覆盖三个数据集的4,368个重构。
- 用Eq@DFuzz对功能等价性进行评估,该差分模糊检测器为每个重构生成1,000–2,000个测试输入。
- 将Eq@DFuzz结果与传统测试集正确性(Corr@Test)进行比较。
- 在数据集(HumanEval、MBPP、APPS)和重构类型(简化、优化)之间分析等价性。
- 报道非等价重构和测试集的空白。

实验结果
研究问题
- RQ1RQ1: 根据差分模糊测试,LLM生成的代码重构有多大比例与原始代码在功能上等价?
- RQ2RQ2: 现有测试套件是否能可靠地检测出非等价性,还是相对于Eq@DFuzz存在差距?
- RQ3RQ3: 等价性率如何随数据集和重构类型变化?
- RQ4RQ4: 重构的复杂性是否影响语义偏离的可能性?
主要发现
| Model | Refactoring | HE | MBPP | APPS | Overall |
|---|---|---|---|---|---|
| CodeLlama | Simplification | 33.33% | 24.24% | 26.55% | 26.23% |
| CodeLlama | Optimization | 30.95% | 23.19% | 15.93% | |
| Codestral | Simplification | 23.81% | 36.07% | 40.35% | 35.14% |
| Codestral | Optimization | 27.12% | 50.85% | 42.11% | |
| StarChat2 | Simplification | 26.23% | 33.33% | 45.54% | 34.24% |
| StarChat2 | Optimization | 32.28% | 32.20% | 35.40% | |
| Qwen-2.5 | Simplification | 13.18% | 27.14% | 30.09% | 22.01% |
| Qwen-2.5 | Optimization | 18.32% | 18.18% | 27.52% | |
| Olmo-3 | Simplification | 18.55% | 12.70% | 43.88% | 21.73% |
| Olmo-3 | Optimization | 14.40% | 8.96% | 28.09% | |
| GPT-4o | Simplification | 8.53% | 15.71% | 20.18% | 18.58% |
| GPT-4o | Optimization | 19.69% | 27.42% | 28.57% |
- 在模型、数据集和重构类型跨越的情况下,LLMs产生了相当大比例的非等价重构(19-35%)。
- APPS数据集的非等价性最高(32.09%),相比MBPP(25.33%)和HumanEval(22.10%)。
- 简化与优化的非等价率相近(约26%)。
- 大约21%的非等价重构在现有测试中通过所有测试( Corr@Test = 1 ),但在Eq@DFuzz下非等价。
- 依赖测试套件可能高估LLM生成的重构的功能等价性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。