QUICK REVIEW

[論文レビュー] Improving Code Generation via Small Language Model-as-a-judge

Giuseppe Crupi, Rosalia Tufano|arXiv (Cornell University)|Feb 12, 2026

Software Engineering Research被引用数 0

ひとこと要約

要旨: 本論文は小規模言語モデル（SLM）をコード正当性の判定者として微調整し、SLM生成解答候補の中から最適解を選択する判定者として用いることで、巨大なLLMコストの一部で競争力のあるコード生成性能を達成する。

ABSTRACT

Large language models (LLMs) have shown remarkable capabilities in automated code generation. While effective for mainstream languages, they may underperform on less common or domain-specific languages, prompting companies to develop in-house code generators. While open-source models can be trained for this, only LLMs with tens of billions of parameters match the performance of commercial tools, demanding costly training and deployment. Recent work proposed supporting code generation with smaller models (SLMs) by generating multiple candidate solutions and using another SLM to select the most likely correct one. The most recent work in this area is the one by Sun et al. [29] presenting RankEF, a T5 model trained to rank code solutions using both execution-based and non-execution-based information. However, Sun et al. do not assess the T5 ranker's classification accuracy, that is, how often it misjudges correct implementations as incorrect or vice versa, leaving open questions about the reliability of LMs as code correctness judges for other tasks (e.g., automated code review). Moreover, their experiments involve relatively old models, making it unclear the extent to which such a methodology would still help companies in cheaply training their own code generators with performance comparable to those of massive LLMs. We present a study addressing these limitations. We train several state-of-the-art SLMs as code correctness judges and assess their ability to discriminate between correct and wrong implementations. We show that modern SLMs outperform RankEF, even without exploiting execution-based information. When used as code rankers, they achieve higher performance gains than RankEF and perform competitively with LLMs 5-25x larger, at a fraction of the cost.

研究の動機と目的

DSLsや希少言語のためのコスト効率の良い社内コード生成ツールの必要性を動機づける。
微調整されたSLMが生成タスクに依存せず、コード正当性を信頼できるかを評価する。
SLMs-as-judgesが複数候補解から最良解を選択することでコード生成性能を向上させるかを評価する。
SLMベースのアプローチをRankEFおよび大規模LLMベースと性能・導入コストの観点で比較する。

提案手法

四つの最先端SLM（Qwen2.5 Coder 0.5B/3B、Gemma-3 4B、Llama-3.2 3B）をコード正当性判定者として微調整し、GPT-4.1-miniおよびRankEFと比較する。
訓練データをJava HumanEval/MBPP（Java）およびCoderEvalベンチマークの722件のコード生成タスクから集成し、候補実装を正誤ラベル付きでテスト実行によって識別する。
5つのコード生成器から各タスクにつき10件の候補解を生成し、判定対象としてタスクあたり50候補を得る。
ゼロショット、少数ショット、微調整（実行フィードバック有無）という4つの設定で判定者を訓練し、F1とテスト結果に対するコーエンのカッパを評価する。
SLMsをトップ性能の判定者として用い、SLM生成器が生成した候補解プールから最良解を選択させ、RankEF・ランダム・対数尤度ベースのベースラインと比較する。
小型モデルの推論ハードウェア要件と巨大LLMの推論コストを比較して、コスト/待機時間を分析する。

実験結果

リサーチクエスチョン

RQ1RQ1: 小規模言語モデルをコード正当性判定に効果的に微調整できるか、GPT-4.1-miniおよびRankEFと比較してどうか。
RQ2RQ2: SLMs-as-judgesを用いてSLM生成器の複数候補から選択することでコード生成性能を向上させられるか、精度とコストの点で巨大LLMとどう比較されるか。

主な発見

SLMsはゼロショットのコード正当性判定には適さないが、微調整により判定精度が大幅に向上する。
微調整済みSLMは真偽に関して中程度の一致を達成（コーエンのカッパ0.45〜0.57）し、いくつかの設定でRankEFを上回る。
SLMs-as-judgesは5つのベンチマークのうち4つで、複数候補から選択する場合にRankEFより高いコード生成性能を示す。
複数のSLM判定者を用いて候補をランキングする方法は、5〜25倍大きいLLMと同等以上の性能を、導入コストを抑えつつ達成する。
小型モデルによる生成器＋判定者のハードウェア/推論コストは、約1千ドル程度と約30B LLNの約17千ドルに比して圧倒的に安価である。
本研究は複製のための公開コードとデータを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。