Skip to main content
QUICK REVIEW

[论文解读] Decomposing Physician Disagreement in HealthBench

Satya Borgohain, Roy Mariathas|arXiv (Cornell University)|Feb 26, 2026
Artificial Intelligence in Healthcare and Education被引用 0
一句话总结

本论文分析 HealthBench 中医生分歧的方差,发现评定标准身份能够解释标签方差的一部分,但对分歧方差的解释很少,大部分分歧由一个巨大的案件级残差驱动;可减少的不确定性可能将分歧概率翻倍,表明需要改进评估设计。

ABSTRACT

We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench's metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.

研究动机与目标

  • Understand where variance in physician disagreement arises in HealthBench.
  • Quantify contributions of rubric identity, physician identity, and metadata to disagreement variance.
  • Assess whether observable features explain disagreement and identify actionable evaluation-design improvements.

提出的方法

  • Decompose variance in HealthBench disagreement across rubric identity, physician identity, and case-level factors.
  • Evaluate the predictive power of metadata labels, pseudo R^2 from normative rubric language, medical specialty, surface features, and embeddings (AUCs).
  • Model physician-validated uncertainty categories to separate reducible vs irreducible uncertainty and their effects on disagreement odds.

实验结果

研究问题

  • RQ1What portion of disagreement variance is explained by rubric identity, physician identity, and metadata?
  • RQ2Do observable features (metadata, language, specialty, surface features, embeddings) predict disagreement?
  • RQ3Is disagreement higher for borderline cases and how does completion quality affect agreement?
  • RQ4How do reducible and irreducible uncertainties impact disagreement odds?

主要发现

  • Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance.
  • Physician identity accounts for 2.4% of disagreement variance.
  • A large 81.8% case-level residual remains unexplained by analyzed features.
  • Metadata labels (z = -0.22, p = 0.83), pseudo R^2 for rubric language (1.2%), medical specialty (no significant Tukey pairs), surface-feature triage (AUC = 0.58), and embeddings (AUC = 0.485) do not reduce the residual.
  • Disagreement follows an inverted-U with completion quality (AUC = 0.689).
  • Reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)); irreducible uncertainty has no effect (OR = 1.01, p = 0.90).
  • Even reducible/irreducible uncertainties explain only ~3% of total variance.
  • Conclusion: the agreement ceiling is largely structural, and closing information gaps could lower disagreement where clinical ambiguity does not apply.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。