QUICK REVIEW

[论文解读] Improving Code Generation via Small Language Model-as-a-judge

Giuseppe Crupi, Rosalia Tufano|arXiv (Cornell University)|Feb 12, 2026

Software Engineering Research被引用 0

一句话总结

该论文研究对小型语言模型（SLMs）进行微调以判断代码正确性，并将它们用作评判者，在多个 SLM 生成的解之间进行选择，在成本远低于大型 LLM 的情况下实现有竞争力的代码生成性能。

ABSTRACT

Large language models (LLMs) have shown remarkable capabilities in automated code generation. While effective for mainstream languages, they may underperform on less common or domain-specific languages, prompting companies to develop in-house code generators. While open-source models can be trained for this, only LLMs with tens of billions of parameters match the performance of commercial tools, demanding costly training and deployment. Recent work proposed supporting code generation with smaller models (SLMs) by generating multiple candidate solutions and using another SLM to select the most likely correct one. The most recent work in this area is the one by Sun et al. [29] presenting RankEF, a T5 model trained to rank code solutions using both execution-based and non-execution-based information. However, Sun et al. do not assess the T5 ranker's classification accuracy, that is, how often it misjudges correct implementations as incorrect or vice versa, leaving open questions about the reliability of LMs as code correctness judges for other tasks (e.g., automated code review). Moreover, their experiments involve relatively old models, making it unclear the extent to which such a methodology would still help companies in cheaply training their own code generators with performance comparable to those of massive LLMs. We present a study addressing these limitations. We train several state-of-the-art SLMs as code correctness judges and assess their ability to discriminate between correct and wrong implementations. We show that modern SLMs outperform RankEF, even without exploiting execution-based information. When used as code rankers, they achieve higher performance gains than RankEF and perform competitively with LLMs 5-25x larger, at a fraction of the cost.

研究动机与目标

为 DSLs 和较少使用的语言的内部成本效益代码生成工具的需求提供动机。
评估微调的 SLMs 是否能在与生成任务无关的情况下可靠地判断代码正确性。
评估作为评审者的 SLMs 是否能通过在多种候选解中选择来提升代码生成性能。
将基于 SLM 的方法与 RankEF 及大型 LLM 基线在性能与部署成本方面进行比较。

提出的方法

将四种最先进的 SLM 进行微调（Qwen2.5 Coder 0.5B/3B、Gemma-3 4B、Llama-3.2 3B），作为代码正确性评审者，并与 GPT-4.1-mini 及 RankEF 进行比较。
从 Java HumanEval/MBPP（Java）和 CoderEval 基准的 722 个代码生成任务中组装训练数据，候选实现通过测试执行标注为正确或不正确。
从五个代码生成器为每个任务生成 10 个候选解，任务总共产生 50 个候选解供评审。
使用四种设置训练评审者：零-shot、少量-shot、微调（有无执行反馈）；使用 F1 和 Cohen’s Kappa 对照测试结果进行评估。
使用表现最佳的 SLM 评审者从由 SLM 生成的候选池中选择最佳解，并与 RankEF、随机和对数似然基线进行比较。
通过比较小模型的推理硬件需求与大型 LLM 的需求，分析成本/延迟。

实验结果

研究问题

RQ1RQ1：是否可以有效地对小型语言模型进行微调以评判代码正确性，它们与 GPT-4.1-mini 及 RankEF 的性能如何比较？
RQ2RQ2：在从 SLM 生成器中选取多个候选解时，作为评审者的 SLM 是否能提升代码生成性能，与大型 LLM 在准确性和成本方面相比如何？

主要发现

SLMs 不能作为零-shot 的代码正确性评审者，但微调后评判准确性显著提升。
微调后的 SLM 与基准之间在 ground truth 上的中等一致性（Cohen’s Kappa 在 0.45 到 0.57 之间），并在若干设定上优于 RankEF。
在从多候选中选取时，SLMs 作为评审者在五个基准中的四个上能实现高于 RankEF 的代码生成性能。
使用多个 SLM 评审者对候选解进行排序，在性能上可与比它们大 5–25 倍的大型 LLM 相当，但部署成本远低。
运行生成器加评审者（使用小模型）所需的硬件/推理成本远低于运行大约 30B 的 LLM（约 1k 与 ~17k 的对比）。
本研究提供用于复现实验的公开代码与数据。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。