[論文レビュー] Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement
The study shows LLMs frequently misjudge correct code against natural-language requirements, with more detailed prompts increasing misjudgments; it proposes a fix-guided verification approach to mitigate this bias.
Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy task requirements, thereby ensuring code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine code against the given task descriptions, which is usually in a form of natural language specifications. In this paper, we uncover a systematic failure of LLMs in matching code to natural language requirements. Specifically, with widely adopted benchmarks and unified prompts design, we demonstrate that LLMs frequently misclassify correct code implementation as non-compliant or defective. Surprisingly, we find that more detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates, highlighting critical reliability issues for LLM-based code assistants. We further analyze the mechanisms driving these failures and evaluate the reliability of rationale-required judgments. Building on these findings, we propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence, and validates the original and revised implementations using benchmark tests and spec-constrained augmented tests. Our results expose previously under-explored limitations in LLM-based code review capabilities, and provide practical guidance for integrating LLM-based reviewers with safeguards in automated review and development pipelines.
研究の動機と目的
- Assess how reliably LLMs judge code conformance to natural-language requirements without test cases.
- Evaluate how different prompting strategies affect false rejections and false acceptances.
- Characterize mechanisms behind false judgments and rationales.
- Explore mitigations to reduce judgment bias in LLM-based code review pipelines.
提案手法
- Assemble a unified benchmark from HumanEval, MBPP, and QuixBugs with paired canonical and buggy implementations (over 1400 instances).
- Evaluate five LLMs (three closed-source, two open-source) under three prompting modes (Direct, Direct+Explain, Full).
- Use confusion-matrix metrics (FPR, FNR) to quantify false positives/negatives for each model-prompt-benchmark combination.
- Analyze rationale reliability via self-consistency and fault-awareness evaluations using external evaluators.
- Propose and assess a Fix-guided Verification Filter leveraging executable counterfactuals and spec-constrained augmented tests.

実験結果
リサーチクエスチョン
- RQ1RQ1: How reliably can LLMs assess code conformance to specifications without test cases?
- RQ2RQ2: How does prompt design affect conformance Judgments and tradeoffs between false negatives and false positives?
- RQ3RQ3: What mechanisms drive false acceptance and false rejection, including bug-type emphasis and rationale patterns?
- RQ4RQ4: How reliable are explanations produced under rationale-required prompts, and do they align with judgments?
- RQ5RQ5: Can mitigation strategies effectively reduce judgment biases in LLM-based reviews?
主な発見
| Model | Prompt | HumanEval FPR (%) | HumanEval FNR (%) | MBPP FPR (%) | MBPP FNR (%) | QuixBugs FPR (%) | QuixBugs FNR (%) |
|---|---|---|---|---|---|---|---|
| GPT-4o | Direct | 2.44 | 26.2 | 3.70 | 35.9 | 10.9 | 35.0 |
| GPT-4o | Direct+Explain | 0.00 | 58.5 | 0.00 | 74.1 | 5.00 | 45.0 |
| GPT-4o | Full | 0.00 | 73.2 | 0.20 | 87.9 | 5.00 | 60.0 |
| Gemini-2.0-flash | Direct | 8.54 | 25.6 | 10.3 | 34.7 | 22.5 | 25.0 |
| Gemini-2.0-flash | Direct+Explain | 7.32 | 23.2 | 11.1 | 35.1 | 22.5 | 22.5 |
| Gemini-2.0-flash | Full | 5.49 | 34.1 | 7.69 | 39.6 | 17.5 | 32.5 |
| Claude-4-5-sonnet | Direct | 2.44 | 26.2 | 6.57 | 58.5 | 5.00 | 40.0 |
| Claude-4-5-sonnet | Direct+Explain | 1.21 | 34.1 | 6.94 | 55.7 | 2.50 | 40.0 |
| Claude-4-5-sonnet | Full | 0.61 | 36.0 | 5.44 | 62.3 | 2.50 | 50.0 |
| Llama-3.1-8B | Direct | 17.1 | 57.3 | 3.56 | 74.7 | 27.5 | 52.5 |
| Llama-3.1-8B | Direct+Explain | 6.71 | 86.6 | 0.38 | 91.9 | 5.00 | 87.5 |
| Llama-3.1-8B | Full | 6.10 | 84.1 | 1.88 | 88.2 | 30.0 | 77.5 |
| Mistral-Small-3.1-24B | Direct | 6.71 | 35.9 | 5.25 | 60.9 | 40.0 | 40.0 |
| Mistral-Small-3.1-24B | Direct+Explain | 14.6 | 31.1 | 7.13 | 47.8 | 40.0 | 32.5 |
| Mistral-Small-3.1-24B | Full | 4.88 | 48.8 | 4.31 | 74.3 | 27.5 | 62.5 |
- LLMs show substantial false negatives when judging correct implementations, indicating over-correction bias.
- Prompt complexity often shifts errors from false negatives to false positives, revealing a tradeoff rather than a universal improvement.
- GPT-4o exhibits strongest over-correction with more detailed prompts, while some models show high unsafe acceptance on buggy code.
- Rationale outputs can be inconsistent with verdicts (self-consistency issues) and may not reliably reflect fault-aware reasoning.
- Open-source models generally exhibit higher error rates and stronger sensitivity to prompts than some closed-source models.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。