[论文解读] Decomposing Physician Disagreement in HealthBench
本论文分析 HealthBench 中医生分歧的方差,发现评定标准身份能够解释标签方差的一部分,但对分歧方差的解释很少,大部分分歧由一个巨大的案件级残差驱动;可减少的不确定性可能将分歧概率翻倍,表明需要改进评估设计。
We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench's metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.
研究动机与目标
- Understand where variance in physician disagreement arises in HealthBench.
- Quantify contributions of rubric identity, physician identity, and metadata to disagreement variance.
- Assess whether observable features explain disagreement and identify actionable evaluation-design improvements.
提出的方法
- Decompose variance in HealthBench disagreement across rubric identity, physician identity, and case-level factors.
- Evaluate the predictive power of metadata labels, pseudo R^2 from normative rubric language, medical specialty, surface features, and embeddings (AUCs).
- Model physician-validated uncertainty categories to separate reducible vs irreducible uncertainty and their effects on disagreement odds.
实验结果
研究问题
- RQ1What portion of disagreement variance is explained by rubric identity, physician identity, and metadata?
- RQ2Do observable features (metadata, language, specialty, surface features, embeddings) predict disagreement?
- RQ3Is disagreement higher for borderline cases and how does completion quality affect agreement?
- RQ4How do reducible and irreducible uncertainties impact disagreement odds?
主要发现
- Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance.
- Physician identity accounts for 2.4% of disagreement variance.
- A large 81.8% case-level residual remains unexplained by analyzed features.
- Metadata labels (z = -0.22, p = 0.83), pseudo R^2 for rubric language (1.2%), medical specialty (no significant Tukey pairs), surface-feature triage (AUC = 0.58), and embeddings (AUC = 0.485) do not reduce the residual.
- Disagreement follows an inverted-U with completion quality (AUC = 0.689).
- Reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)); irreducible uncertainty has no effect (OR = 1.01, p = 0.90).
- Even reducible/irreducible uncertainties explain only ~3% of total variance.
- Conclusion: the agreement ceiling is largely structural, and closing information gaps could lower disagreement where clinical ambiguity does not apply.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。