QUICK REVIEW

[论文解读] Decomposing Physician Disagreement in HealthBench

Satya Borgohain, Roy Mariathas|arXiv (Cornell University)|Feb 26, 2026

Artificial Intelligence in Healthcare and Education被引用 0

一句话总结

本论文分析 HealthBench 中医生分歧的方差，发现评定标准身份能够解释标签方差的一部分，但对分歧方差的解释很少，大部分分歧由一个巨大的案件级残差驱动；可减少的不确定性可能将分歧概率翻倍，表明需要改进评估设计。

ABSTRACT

We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench's metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.

研究动机与目标

Understand where variance in physician disagreement arises in HealthBench.
Quantify contributions of rubric identity, physician identity, and metadata to disagreement variance.
Assess whether observable features explain disagreement and identify actionable evaluation-design improvements.

提出的方法

Decompose variance in HealthBench disagreement across rubric identity, physician identity, and case-level factors.
Evaluate the predictive power of metadata labels, pseudo R^2 from normative rubric language, medical specialty, surface features, and embeddings (AUCs).
Model physician-validated uncertainty categories to separate reducible vs irreducible uncertainty and their effects on disagreement odds.

实验结果

研究问题

RQ1What portion of disagreement variance is explained by rubric identity, physician identity, and metadata?
RQ2Do observable features (metadata, language, specialty, surface features, embeddings) predict disagreement?
RQ3Is disagreement higher for borderline cases and how does completion quality affect agreement?
RQ4How do reducible and irreducible uncertainties impact disagreement odds?

主要发现

Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance.
Physician identity accounts for 2.4% of disagreement variance.
A large 81.8% case-level residual remains unexplained by analyzed features.
Metadata labels (z = -0.22, p = 0.83), pseudo R^2 for rubric language (1.2%), medical specialty (no significant Tukey pairs), surface-feature triage (AUC = 0.58), and embeddings (AUC = 0.485) do not reduce the residual.
Disagreement follows an inverted-U with completion quality (AUC = 0.689).
Reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)); irreducible uncertainty has no effect (OR = 1.01, p = 0.90).
Even reducible/irreducible uncertainties explain only ~3% of total variance.
Conclusion: the agreement ceiling is largely structural, and closing information gaps could lower disagreement where clinical ambiguity does not apply.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。