QUICK REVIEW

[论文解读] Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Tianyi Huang, Nathan Huang|arXiv (Cornell University)|Mar 20, 2026

Topic Modeling被引用 0

一句话总结

PCFJudge 在候选集的多种排列上重复执行同一事实性优先的逐条评估提示，并聚合结果以产生鲁棒的共识分数，减少顺序引起的不稳定性。

ABSTRACT

Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.

研究动机与目标

Motivate and address the instability of LLM judges due to candidate-order sensitivity in listwise factuality evaluation.
Introduce a training-free, inference-time method (PCFJudge) to achieve order-robust consensus without retraining or external verifiers.
Formalize permutation-consensus as an order-robust estimator and analyze its error-reduction properties under a weak-independence assumption.
Demonstrate gains on RewardBench 2 Factuality with two backbones and assess transfer to JudgeBench via development ablations.

提出的方法

Define a factuality-first listwise prompt and its outputs (score, rationale, binary flags).
Run the same prompt over K permutations of the candidate list and map outputs back to original candidates.
Aggregate per-candidate statistics across permutations: mean score, Borda-style rank contribution, top-set indicator, and calibrated uncertainty.
Compute a final consensus score C_i as a weighted combination of these statistics: C_i = 0.50 s̄_i + 0.25 B_i + 0.20(100 v_i) + 0.05(100 u_i).
Use K=7 in final RewardBench 2 experiments to derive the consensus and select the winner.

实验结果

研究问题

RQ1Can candidate-order variation in listwise factuality evaluation be effectively mitigated without retraining or extra verification steps?
RQ2Does averaging across multiple permutations produce a more reliable judge than a single-pass evaluation in real-world datasets?
RQ3How does permutation-consensus affect performance in factuality-focused listwise settings compared to pairwise transfer setups?
RQ4Under what conditions does order-robust judging provide the strongest gains and how transferable is the approach across backbones?

主要发现

On RewardBench 2 Factuality, PCFJudge improved GPT-5.4 by +5.17 absolute points and Claude Sonnet 4.6 by +7.00 points over direct judging on 300-example slices.
Across both backbones (600 total examples), PCFJudge achieved a weighted average gain of +6.08 points.
Discordant improvement vs regression occurred in 69/29 cases (p<10^-4), indicating a robust positive effect.
JudgeBench transfer results showed positive but smaller gains: +3.24 (Claude Sonnet 4.6) and +2.70 (GPT-5.4) on 100-pair slices.
Development ablations indicate most gains come from permutation consensus itself rather than heavier arbitration layers.
Qualitative patterns show improved reliability against unsupported specificity and over-confident, order-sensitive outputs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。