QUICK REVIEW

[论文解读] Reasoning Distillation for Lightweight Automated Program Repair

Aanand K. Balasubramanian, Sashank Silwal|arXiv (Cornell University)|Jan 16, 2026

Software Testing and Debugging Techniques被引用 0

一句话总结

该论文表明，从大型教师模型蒸馏得到的轻量化符号推理监督可以提升基于紧凑 CodeT5 的学生模型在修复类型分类上的表现，并且不会增加模型大小，对罕见错误类别的提升更明显。

ABSTRACT

We study whether lightweight symbolic reasoning supervision can improve fix type classification in compact automated program repair models. Small code models are attractive for resource-constrained settings, but they typically produce only a single prediction, making it unclear whether they learn meaningful program structure or rely on shallow correlations. We propose a reasoning distillation approach in which a large teacher model provides structured symbolic reasoning tags alongside fix-type labels. These tags capture high-level causal properties of bugs without relying on free-form explanations. We train a CodeT5-based student model under label-only and reasoning-distilled settings on the IntroClass benchmark. Reasoning supervision consistently improves macro averaged performance, particularly on less frequent bug categories, without increasing model size or complexity. We further analyze the relationship between reasoning accuracy and fix-type prediction, showing that correct reasoning traces strongly correlate with correct predictions, while not fully determining them. Our results suggest that symbolic reasoning distillation is a practical way to improve interpretability and robustness in lightweight program repair models.

研究动机与目标

为资源受限环境下的轻量调试工具提供动机。
研究符号推理监督蒸馏是否可以提升紧凑模型的修复类型预测。
评估推理监督对准确率、宏F1以及推理质量的影响。
评估推理轨迹是否与改进的缺陷类型分类相关。

提出的方法

使用一个大型教师模型生成修复类型标签和紧凑的符号推理标签。
在 CodeT5 基于的学生模型上进行两种训练：仅标签训练 vs 联合预测修复类型和推理标签。
在 IntroClass 数据集上以固定的训练/验证划分进行评估。
比较修复类型预测的准确率和宏F1，以及评估推理轨迹对教师的保真度。
分析每个修复类型的增益，以及修复类型准确度对推理正确性的条件依赖。

实验结果

研究问题

RQ1是否可以通过从大型教师模型蒸馏得到的轻量级符号推理监督，在不增加模型大小或复杂度的情况下，提升紧凑自动化程序修复模型的修复类型分类？
RQ2联合监督修复类型标签和结构化符号推理标签是否比仅标签训练带来更好表现，且推理准确性与修复类型预测之间的关系如何？
RQ3小模型在多大程度上能够复现教师生成的符号推理轨迹，以及这与下游的错误分类有何关系？
RQ4增益是否集中在不太频繁或更复杂的错误类别，以及在此设置下蒸馏推理的局限性？

主要发现

Model	Accuracy	Macro F1
Student (label-only)	0.491	0.213
Student (reasoning-distilled)	0.544	0.249

推理蒸馏后的学生在修复类型精确度上更高（0.544 对 0.491），宏F1也更高（0.249 对 0.213），优于仅标签的基线。
推理监督带来更强的宏平均提升，有助于较不频繁的错误类别。
推理轨迹以高保真度被复现，推理宏F1 为 0.545，完全匹配为 0.789；大多数主要标签的准确率超过 0.87。
对每个修复类型的增益在 WRONG_CONDITION、LOOP_BOUND、WRONG_OPERATOR、MISSING_CASE 上最大，显著高于基线。
在某些情况下，正确的推理并不意味着正确的修复类型，表明推理有助于内部表征，但并未完全解决分类歧义。
一个辅助的基于 JSON 的蒸馏研究表明 JSON 监督更具表达性，但对小模型在低数据情景下更具挑战性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。