QUICK REVIEW

[论文解读] Open Category Detection with PAC Guarantees

Si Liu, Risheek Garrepalli|arXiv (Cornell University)|Aug 1, 2018

Domain Adaptation and Few-Shot Learning被引用 41

一句话总结

本文在两组训练场景下研究开放类别检测，具备 PAC 风格保证（干净名义数据与带有已知上限的污染混合数据）。它提出一种使用异常分数的阈值方法，以实现用户指定的外来样本检测率，并提供有限样本保证并进行实证评估。

ABSTRACT

Open category detection is the problem of detecting "alien" test instances that belong to categories or classes that were not present in the training data. In many applications, reliably detecting such aliens is central to ensuring the safety and accuracy of test set predictions. Unfortunately, there are no algorithms that provide theoretical guarantees on their ability to detect aliens under general assumptions. Further, while there are algorithms for open category detection, there are few empirical results that directly report alien detection rates. Thus, there are significant theoretical and empirical gaps in our understanding of open category detection. In this paper, we take a step toward addressing this gap by studying a simple, but practically-relevant variant of open category detection. In our setting, we are provided with a "clean" training set that contains only the target categories of interest and an unlabeled "contaminated" training set that contains a fraction $α$ of alien examples. Under the assumption that we know an upper bound on $α$, we develop an algorithm with PAC-style guarantees on the alien detection rate, while aiming to minimize false alarms. Empirical results on synthetic and standard benchmark datasets demonstrate the regimes in which the algorithm can be effective and provide a baseline for further advancements.

研究动机与目标

将开放类别检测视为一个安全关键问题，需要以保证的比例检测到外来样本。
提出一个简单的两组训练集设置：干净的名义数据加上带有外来分布上限α的污染数据。
开发一个 PAC 风格的方法，在控制误报的同时保证用户指定的外来样本检测率。
提供有限样本保证，并展示 α 的上界如何影响性能和数据需求。
使用异常检测器在合成数据和标准数据集上对该方法进行基准评估。

提出的方法

为名义数据（F0）、外来数据（Fa）和混合数据（Fm）定义异常分数分布，并在已知 α 时从 Fm 和 F0 推导 Fa。
从 S0 和 Sm 计算经验分布函数并通过 Fa_hat(x) = (Fm_hat(x) − (1−α)F0_hat(x)) / α 形成经验外来分布函数 Fa_hat。
将阈值 τ̂_q 确定为 Fa_hat(τ̂_q) ≤ q 的最大分数，以实现 1−q 的外来样本召回率。
在阈值处理前对 Fa_hat 进行单调化和裁剪，以确保其为有效的分布函数（CDF）。
给出有限样本保证（定理 1），给出达到目标召回率 1−η 所需的样本量 n，取决于 ε 和 δ，其中 n = O((1/ε^2 α^2) log(1/δ))。
讨论在可接受的异常检测器下放宽 α 的条件（F0 ≤ Fm 对所有 x），以及对保证的影响。

实验结果

研究问题

RQ1在已知外来比例上限 α 的两组训练设置下，是否可以在具有 PAC 风格保证的前提下实现开放类别检测？
RQ2异常检测器的质量如何影响有限样本下的外来召回率与名义误报率？
RQ3为了保证目标外来检测率，需要的样本量是多少，以及高估 α 如何影响性能？
RQ4在合成数据和标准基准数据集上，随着 α 与 n 的变化，所提出的保证的经验表现如何？

主要发现

在已知或有上界的 α 时，提出的阈值方法能够在有限样本保证下实现用户指定的外来样本检测率。
召回率随着 n 的增大和 α 的增大而提高，而误报率依赖于异常检测器的质量与领域，在基准数据上观察到非平凡的 FPR。
所需样本量与 1/(ε^2 α^2) 和 log(1/δ) 成多项式增长，表明当 α 减小时数据需求上升。
在若干个 UCI 和视觉数据集上，对于较大 n，实证结果显示召回率接近 1−q，但对于较小数据集或非常小的 α 仍存在差距。
高估 α 对 FPR 的降幅大于对召回的提升，凸显了准确估计 α 的重要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。