Skip to main content
QUICK REVIEW

[论文解读] Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

Tianyi Huang, Nathan Huang|arXiv (Cornell University)|Mar 20, 2026
Topic Modeling被引用 0
一句话总结

PCFJudge 在候选集的多种排列上重复执行同一事实性优先的逐条评估提示,并聚合结果以产生鲁棒的共识分数,减少顺序引起的不稳定性。

ABSTRACT

Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.

研究动机与目标

  • Motivate and address the instability of LLM judges due to candidate-order sensitivity in listwise factuality evaluation.
  • Introduce a training-free, inference-time method (PCFJudge) to achieve order-robust consensus without retraining or external verifiers.
  • Formalize permutation-consensus as an order-robust estimator and analyze its error-reduction properties under a weak-independence assumption.
  • Demonstrate gains on RewardBench 2 Factuality with two backbones and assess transfer to JudgeBench via development ablations.

提出的方法

  • Define a factuality-first listwise prompt and its outputs (score, rationale, binary flags).
  • Run the same prompt over K permutations of the candidate list and map outputs back to original candidates.
  • Aggregate per-candidate statistics across permutations: mean score, Borda-style rank contribution, top-set indicator, and calibrated uncertainty.
  • Compute a final consensus score C_i as a weighted combination of these statistics: C_i = 0.50 s̄_i + 0.25 B_i + 0.20(100 v_i) + 0.05(100 u_i).
  • Use K=7 in final RewardBench 2 experiments to derive the consensus and select the winner.

实验结果

研究问题

  • RQ1Can candidate-order variation in listwise factuality evaluation be effectively mitigated without retraining or extra verification steps?
  • RQ2Does averaging across multiple permutations produce a more reliable judge than a single-pass evaluation in real-world datasets?
  • RQ3How does permutation-consensus affect performance in factuality-focused listwise settings compared to pairwise transfer setups?
  • RQ4Under what conditions does order-robust judging provide the strongest gains and how transferable is the approach across backbones?

主要发现

  • On RewardBench 2 Factuality, PCFJudge improved GPT-5.4 by +5.17 absolute points and Claude Sonnet 4.6 by +7.00 points over direct judging on 300-example slices.
  • Across both backbones (600 total examples), PCFJudge achieved a weighted average gain of +6.08 points.
  • Discordant improvement vs regression occurred in 69/29 cases (p<10^-4), indicating a robust positive effect.
  • JudgeBench transfer results showed positive but smaller gains: +3.24 (Claude Sonnet 4.6) and +2.70 (GPT-5.4) on 100-pair slices.
  • Development ablations indicate most gains come from permutation consensus itself rather than heavier arbitration layers.
  • Qualitative patterns show improved reliability against unsupported specificity and over-confident, order-sensitive outputs.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。