QUICK REVIEW

[论文解读] Unsupervised anomaly detection algorithms on real-world data: how many do we need?

Roel Bouman, Zaharah Bukhsh|arXiv (Cornell University)|May 1, 2023

Anomaly Detection Techniques and Applications被引用 12

一句话总结

该论文基准测试了32种无监督异常检测算法在52个现实世界的多变量数据集上，发现kNN在局部异常上占主导地位，EIF在全局异常上占主导地位，而三算法工具箱在整体上足以应对。

ABSTRACT

In this study we evaluate 32 unsupervised anomaly detection algorithms on 52 real-world multivariate tabular datasets, performing the largest comparison of unsupervised anomaly detection algorithms to date. On this collection of datasets, the $k$-thNN (distance to the $k$-nearest neighbor) algorithm significantly outperforms the most other algorithms. Visualizing and then clustering the relative performance of the considered algorithms on all datasets, we identify two clear clusters: one with ``local'' datasets, and another with ``global'' datasets. ``Local'' anomalies occupy a region with low density when compared to nearby samples, while ``global'' occupy an overall low density region in the feature space. On the local datasets the $k$NN ($k$-nearest neighbor) algorithm comes out on top. On the global datasets, the EIF (extended isolation forest) algorithm performs the best. Also taking into consideration the algorithms' computational complexity, a toolbox with these three unsupervised anomaly detection algorithms suffices for finding anomalies in this representative collection of multivariate datasets. By providing access to code and datasets, our study can be easily reproduced and extended with more algorithms and/or datasets.

研究动机与目标

评估大量无监督异常检测算法在现实世界的多变量数据上的性能。
识别不同数据集类型（局部异常与全局异常）是否会影响算法性能。
为现实场景中的异常检测提供紧凑且高效工具箱的实用指南。
评估计算方面的考虑，以在准确性和效率之间取得平衡。
通过共享代码和数据集来确保可复现性。

提出的方法

评估32种异常检测算法，主要来自PyOD，在52个现实世界的多变量数据集上进行。
对每个数据集，在一系列合理的超参数下运行每个算法，并取ROC-AUC分数的平均值。
通过去重、居中处理，以及按四分位距缩放来预处理数据，以降低对异常的敏感性。
以ROC-AUC作为主要评估指标，并计算每个数据集上算法性能的排序。
应用Iman-Davenport检验以检测总体差异，随后进行Nemenyi事后检验以确定成对显著性。
提供一个公开的GitHub仓库，包含代码和数据，以实现完全可复现性。

实验结果

研究问题

RQ1哪些无监督异常检测算法在现实世界的多变量表数据上表现最好？
RQ2算法的性能是否会因数据集表现为局部异常还是全局异常而不同？
RQ3一个小型、实用的算法工具箱能否在具有代表性数据集集合上有效识别异常？
RQ4计算复杂性等因素如何影响实际选择算法？
RQ5在没有超参数优化的情况下，无监督异常检测的通用可推广指南是什么？

主要发现

大量算法的表现相当，数据集上的中位性能约为最佳的90%左右。
kth-NN与kNN变体始终优于大多数的算法，且经常占据优势，特别是在局部异常数据集上。
扩展孤立森林（EIF）在全局异常数据集上表现最强。
CBLOF始终被其他算法超越，整体表现最差。
基于神经网络的方法（DeepSVDD、ALAD、SO-GAAL）在现实世界的表格数据上往往表现不佳，原因在于设计和超参数敏感性。
出现两个数据集簇：局部异常簇，其中局部方法表现出色；全局异常簇，其中更广泛的方法集合表现最佳。
一个由kth-NN（或kNN家族）、EIF和一个稳健的全局检测器组成的三算法工具箱足以覆盖所考虑的数据集，在准确性与效率之间取得平衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。