QUICK REVIEW

[论文解读] Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning

Zhuofan Xie, Zishan Lin|arXiv (Cornell University)|Feb 21, 2026

COVID-19 diagnosis using AI被引用 0

一句话总结

SaE 将 VLM 文本-图像相似性重新表述为 Dirichlet 证据，以校准不确定性，从而在医学影像中实现可解释且标注高效的主动学习；在 20% 标注预算下达到 SOTA 的宏观准确率（82.57%），在 BTMRI 上具有强校准性（NLL 0.425），覆盖十个数据集。

ABSTRACT

Active Learning (AL) reduces annotation costs in medical imaging by selecting only the most informative samples for labeling, but suffers from cold-start when labeled data are scarce. Vision-Language Models (VLMs) address the cold-start problem via zero-shot predictions, yet their temperature-scaled softmax outputs treat text-image similarities as deterministic scores while ignoring inherent uncertainty, leading to overconfidence. This overconfidence misleads sample selection, wasting annotation budgets on uninformative cases. To overcome these limitations, the Similarity-as-Evidence (SaE) framework calibrates text-image similarities by introducing a Similarity Evidence Head (SEH), which reinterprets the similarity vector as evidence and parameterizes a Dirichlet distribution over labels. In contrast to a standard softmax that enforces confident predictions even under weak signals, the Dirichlet formulation explicitly quantifies lack of evidence (vacuity) and conflicting evidence (dissonance), thereby mitigating overconfidence caused by rigid softmax normalization. Building on this, SaE employs a dual-factor acquisition strategy: high-vacuity samples (e.g., rare diseases) are prioritized in early rounds to ensure coverage, while high-dissonance samples (e.g., ambiguous diagnoses) are prioritized later to refine boundaries, providing clinically interpretable selection rationales. Experiments on ten public medical imaging datasets with a 20% label budget show that SaE attains state-of-the-art macro-averaged accuracy of 82.57%. On the representative BTMRI dataset, SaE also achieves superior calibration, with a negative log-likelihood (NLL) of 0.425.

研究动机与目标

解决 VLM 驱动的医学主动学习中的冷启动和过度自信问题。
通过证据推理提供经校准、可解释的不确定性信号。
开发利用 Vacuity 与 Dissonance 的双因素获取策略进行样本选择。
利用 PubMed 增强提示来丰富医学语义空间。

提出的方法

引入 Similarity Evidence Head (SEH)，将相似性向量映射到带有正强度 λ 的 Dirichlet 证据。
使用 PubMed 增强提示，创建语义丰富的类别原型以进行 VLM 相似性计算。
用双目标损失训练 SEH，平衡分类性能与证据校准（式( Eq. 3 )）。
将基于相似性的证据转换为 Dirichlet 参数 alpha_k(x)，公式为 alpha_k = λ * p_k + 1（式( Eq. 4 )）。
将证据分解为 Vacuity 与 Dissonance 以指导获取（式(5–6)）。
应用双因素主动学习分数（式( Eq. 7 )），并使用线性调度（式( Eq. 8 )），在早期轮次偏向高 Vac，后期偏向高 Dis。

实验结果

研究问题

RQ1 frozen VLM 的基于相似性的证据是否能校准为反映医学主动学习不确定性的 Dirichlet 分布？
RQ2 Vacuity 与 Dissonance 是否为样本选择提供具有临床意义、可解释的线索？
RQ3与现有医学影像主动学习基线相比，双因素获取策略是否提高了标注效率？
RQ4 PubMed 增强提示是否提升 VLM 在医学概念的语义对齐以用于主动学习？

主要发现

SaE 在十个数据集的 20% 标注预算下实现宏平均准确率 82.57%，优于基线。
SaE 在 20% 预算下在 BT MRI 上呈现更好的校准性，NLL 0.425、ECE 0.021，表示不确定性校准良好。
消融研究表明 SEH 对性能至关重要；双因素评分和 VLM 相似性贡献显著收益。
SaE 展现出快速的早期轮次收敛，提升样本效率并缓解冷启动问题。
实验表明在覆盖多器官和数据集的情况下，相对于 Random、PCB、MedCoOp 基于的方法，以及 BiomedCoOp，均有持续改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。