[论文解读] Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
PCFJudge 在候选集的多种排列上重复执行同一事实性优先的逐条评估提示,并聚合结果以产生鲁棒的共识分数,减少顺序引起的不稳定性。
Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.
研究动机与目标
- Motivate and address the instability of LLM judges due to candidate-order sensitivity in listwise factuality evaluation.
- Introduce a training-free, inference-time method (PCFJudge) to achieve order-robust consensus without retraining or external verifiers.
- Formalize permutation-consensus as an order-robust estimator and analyze its error-reduction properties under a weak-independence assumption.
- Demonstrate gains on RewardBench 2 Factuality with two backbones and assess transfer to JudgeBench via development ablations.
提出的方法
- Define a factuality-first listwise prompt and its outputs (score, rationale, binary flags).
- Run the same prompt over K permutations of the candidate list and map outputs back to original candidates.
- Aggregate per-candidate statistics across permutations: mean score, Borda-style rank contribution, top-set indicator, and calibrated uncertainty.
- Compute a final consensus score C_i as a weighted combination of these statistics: C_i = 0.50 s̄_i + 0.25 B_i + 0.20(100 v_i) + 0.05(100 u_i).
- Use K=7 in final RewardBench 2 experiments to derive the consensus and select the winner.
实验结果
研究问题
- RQ1Can candidate-order variation in listwise factuality evaluation be effectively mitigated without retraining or extra verification steps?
- RQ2Does averaging across multiple permutations produce a more reliable judge than a single-pass evaluation in real-world datasets?
- RQ3How does permutation-consensus affect performance in factuality-focused listwise settings compared to pairwise transfer setups?
- RQ4Under what conditions does order-robust judging provide the strongest gains and how transferable is the approach across backbones?
主要发现
- On RewardBench 2 Factuality, PCFJudge improved GPT-5.4 by +5.17 absolute points and Claude Sonnet 4.6 by +7.00 points over direct judging on 300-example slices.
- Across both backbones (600 total examples), PCFJudge achieved a weighted average gain of +6.08 points.
- Discordant improvement vs regression occurred in 69/29 cases (p<10^-4), indicating a robust positive effect.
- JudgeBench transfer results showed positive but smaller gains: +3.24 (Claude Sonnet 4.6) and +2.70 (GPT-5.4) on 100-pair slices.
- Development ablations indicate most gains come from permutation consensus itself rather than heavier arbitration layers.
- Qualitative patterns show improved reliability against unsupported specificity and over-confident, order-sensitive outputs.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。