QUICK REVIEW

[论文解读] In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation

Julian Bitterwolf, Maximilian A. Müller|arXiv (Cornell University)|Jun 1, 2023

Adversarial Robustness in Machine Learning被引用 11

一句话总结

这篇论文表明，许多 ImageNet-1K OOD 数据集包含 ID 污染，提出 NINCO（No ImageNet Class Objects）具有 64 个 OOD 类和 5879 张经清洗的图像，并分析跨多种体系结构的广泛 OOD 检测器，强调预训练的影响以及对按类评估和 OOD 单元测试的需求。

ABSTRACT

Out-of-distribution (OOD) detection is the problem of identifying inputs which are unrelated to the in-distribution task. The OOD detection performance when the in-distribution (ID) is ImageNet-1K is commonly being tested on a small range of test OOD datasets. We find that most of the currently used test OOD datasets, including datasets from the open set recognition (OSR) literature, have severe issues: In some cases more than 50$\%$ of the dataset contains objects belonging to one of the ID classes. These erroneous samples heavily distort the evaluation of OOD detectors. As a solution, we introduce with NINCO a novel test OOD dataset, each sample checked to be ID free, which with its fine-grained range of OOD classes allows for a detailed analysis of an OOD detector's strengths and failure modes, particularly when paired with a number of synthetic "OOD unit-tests". We provide detailed evaluations across a large set of architectures and OOD detection methods on NINCO and the unit-tests, revealing new insights about model weaknesses and the effects of pretraining on OOD detection performance. We provide code and data at https://github.com/j-cb/NINCO.

研究动机与目标

识别并量化广泛使用的 ImageNet-1K OOD 测试数据集中出现的 ID 污染。
提出一个干净、具有挑战性的 OOD 测试集（NINCO），配有按类别的评估，以更好地理解检测器的弱点。
分析不同 OOD 检测方法在多种体系结构和预训练方案下的表现。
引入 OOD 单元测试，以探究超出自然图像的检测器弱点。
为 OOD 检测器的公正评估与报告提供建议。

提出的方法

对常用 OOD 数据集的每个数据集进行系统性手工随机抽样 400 个样本，以衡量 ID 污染。
构建 NINCO，涵盖 64 个 OOD 类、5879 张图像，手动验证为无 ID，并再加 17 个合成的 OOD 单元测试。
在多种架构（ViT、卷积网络）及不同预训练（IN-21K、CLIP、JFT 等）下，对 11 种 OOD 检测方法进行评估。
对 MSP 基线、基于特征的检测器（Maha、RMaha、ViM）以及其他方法（MaxLogit、Energy、KL-Matching、KNN、ReAct 等）的分析，包括对 pre-logit 特征的使用。
评估预训练如何影响 OOD 检测性能，以及聚合指标与按类别指标的可靠性。

Figure 3: OOD-detection before and after removing samples with ID-objects: We show FPR (lower is better) of two OOD detectors (MSP and Mahalanobis distance) for a ViT, evaluated on cleaned and full subsets of four popular OOD datasets.

实验结果

研究问题

RQ1现有的 ImageNet-1K OOD 测试数据集在多大程度上被同分布对象污染？
RQ2一个干净、无 ID 的 OOD 测试集（NINCO）是否能在跨架构上提供更可靠的 OOD 检测器评估？
RQ3预训练类型和特征使用对 OOD 检测性能的影响是什么？
RQ4合成 OOD 单元测试是否揭示了自然图像数据集未暴露的弱点？
RQ5为公平地对 OOD 检测器进行基准测试，应采用哪些评估做法（按类别分布、单元测试）？

主要发现

许多用于 IN-1K 的广泛使用的 OOD 数据集包含大量的 ID 污染，在 Places 和 Species 数据集等地方往往超过 50%。
ID 污染可能不公平地惩罚强检测器并增加假阳性，因为检测器可能正确识别了不应被视为 OOD 的 ID 内容。
NINCO 提供 64 个经人工验证的 OOD 类，包含 5,879 张无 ID 的图像，使对检测器的强项与失败模式进行详细分析成为可能，并附带 17 个合成的单元测试以探测弱点。
在更大的数据集上进行预训练通常会提升 OOD 检测，且基于预登录特征的检测方法（pre-logit）往往优于 MSP，但其收益对模型与预训练强烈依赖。
明确使用 pre-logit 特征（基于特征的检测器）在各模型上产生更一致的改进；然而，零样本 CLIP 基方法在 NINCO 上并不超过 IN-1K 分类器。
在 NINCO 上，先进检测器带来的平均 FPR 改进比在某些传统基准上更显著，且按类别的分析显示 OOD 类之间的表现差异很大。

Figure 5: Cumulative distribution of the % of NINCO-classes for which an FPR at least as low as a given x-value is achieved. The area over this curve corresponds to the mean FPR. The further in the top left corner, the better. The best methods explicitly access pre-logit features (Left): Different O

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。