QUICK REVIEW

[论文解读] A Benchmark of Medical Out of Distribution Detection

Tianshi Cao, Chin‐Wei Huang|arXiv (Cornell University)|Jul 8, 2020

COVID-19 diagnosis using AI参考文献 19被引用 39

一句话总结

本文在四个医学影像领域（胸部X线、眼底成像、组织学）使用三种OoD用例对多种 OoD 检测方法进行基准测试，发现简单的二元特征分类器往往表现最好，且对于接近训练分布的样本检测困难。

ABSTRACT

Motivation: Deep learning models deployed for use on medical tasks can be equipped with Out-of-Distribution Detection (OoDD) methods in order to avoid erroneous predictions. However it is unclear which OoDD method should be used in practice. Specific Problem: Systems trained for one particular domain of images cannot be expected to perform accurately on images of a different domain. These images should be flagged by an OoDD method prior to diagnosis. Our approach: This paper defines 3 categories of OoD examples and benchmarks popular OoDD methods in three domains of medical imaging: chest X-ray, fundus imaging, and histology slides. Results: Our experiments show that despite methods yielding good results on some categories of out-of-distribution samples, they fail to recognize images close to the training distribution. Conclusion: We find a simple binary classifier on the feature representation has the best accuracy and AUPRC on average. Users of diagnostic tools which employ these OoDD methods should still remain vigilant that images very close to the training distribution yet not in it could yield unexpected results.

研究动机与目标

定义三个医学影像中的 OoD 用例并说明对鲁棒 OoD 检测的需求。
在多种医学影像领域评估广泛的 OoD 方法（仅数据、仅分类器，以及带辅助模型的方法）。
提供关于有效 OoD 方法的实际指南，并强调 inputs 与训练分布接近时的局限性。

提出的方法

定义三类 OoD 类别：无关输入、准备不当的输入、由于训练偏差导致的未见条件。
在胸部X线、眼底和组织学四个数据集的 In 数据上构建实验，并为每个用例设置 Out 数据。
评估三类 OoD 方法：仅数据（如 KNN）、仅分类器（如阈值、SVM、二元分类器）以及带辅助模型的方法（如自编码器、VAE、ALI/BiGAN）。
在 In 数据上训练任务网络（DenseNet-121 ），在混合 In 与 Out 样本的验证集上训练 OoD 方法；在平衡的 In/Out 测试集上进行测试。
探索超参数并进行多次试验，以评估 Out 数据分区下的稳定性和泛化性。

实验结果

研究问题

RQ1哪些 OoD 方法在不同医学影像领域能最好地区分 In 与 Out 样本？
RQ2在医学 OoD 任务中，简单的基于分类器的 OoD 检测是否能够达到或超越带有辅助模型的方法？
RQ3当用例包含无关数据、准备不当和未见疾病时，OoD 性能有何变化？
RQ4多组 Out 数据集对 OoD 检测器的泛化有何影响？
RQ5在临床工作流程中的设置与运行时间方面，OoD 方法的实际权衡是什么？

主要发现

仅分类器方法，特别是二元分类器和 Mahalanobis，在整体上取得强的准确率和 AUPRC，且往 often 超过带有辅助模型的方法。
用例3（未见疾病）的检测性能显著下降，所有方法在某些评估中接近随机。
使用多组 Out 数据集来做 D_val_Out 能提升边界稳定性和泛化性，部分方法提升了性能。
基于 KNN 的仅数据方法在设置/运行时间上具有有利权衡，但由于需要存储训练数据，可能内存密集。
自编码器等辅助模型方法在所有领域并未始终优于更简单的基于分类器的方法，眼底成像是一个显著例外。
在所有评估中，许多 OoD 方法在检测与训练分布非常接近的样本方面仍然困难。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。