QUICK REVIEW

[论文解读] SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Yue Zhao, Xiyang Hu|arXiv (Cornell University)|Mar 11, 2020

Anomaly Detection Techniques and Applications参考文献 47被引用 31

一句话总结

SUOD 提出了一种用于大规模、异构无监督异常检测的模块化加速框架，通过结合数据降维、模型近似和均衡的分布式调度来实现，并开源发布。

ABSTRACT

Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised, heterogeneous models (i.e., different algorithms with varying hyperparameters) for further combination and analysis, rather than relying on a single model. How to accelerate the training and scoring on new-coming samples by outlyingness (referred as prediction throughout the paper) with a large number of unsupervised, heterogeneous OD models? In this study, we propose a modular acceleration system, called SUOD, to address it. The proposed system focuses on three complementary acceleration aspects (data reduction for high-dimensional data, approximation for costly models, and taskload imbalance optimization for distributed environment), while maintaining performance accuracy. Extensive experiments on more than 20 benchmark datasets demonstrate SUOD's effectiveness in heterogeneous OD acceleration, along with a real-world deployment case on fraudulent claim analysis at IQVIA, a leading healthcare firm. We open-source SUOD for reproducibility and accessibility.

研究动机与目标

推动使用异构的无监督异常检测器，以提升相对于单一模型方法的鲁棒性。
开发一个端到端的加速框架，解决数据、模型和执行瓶颈。
在大型高维数据集上显著降低训练和预测时间的同时，保持检测精度。
通过广泛的基准测试和实际的欺诈检测部署来证明有效性。

提出的方法

对每个基模型应用 Johnson-Lindenstrauss 随机投影以创建低维子空间，以保持成对距离并引入多样性。
采用伪监督近似，将成本高的无监督检测器替换为在伪地面真值上训练的快速监督回归器（伪地面真值指检测器在训练数据上的输出）。
使用模型成本预测器来预测执行时间，并在工作节点间实现平衡并行调度，以降低任务负载不均衡。

实验结果

研究问题

RQ1数据层面的随机投影在降低维数的同时，是否可以保留对异常值相关的结构，以用于异构 OD 集合？
RQ2伪监督近似在不显著损失准确性的前提下加速预测的效果如何？
RQ3在不同的 m 个模型和 t 个工作节点下，预测的、平衡的调度是否提高分布式异构 OD 的训练/预测效率？
RQ4将数据降维、模型近似和调度结合在一起时，SUOD 的整体性能权衡是什么？

主要发现

通过 JL 投影方法进行的数据压缩，尤其是 circulant 和 toeplitz，在与无投影或 PCA 相比时，在 ROC 和精度指标上具有可观的时间节省且相当或更优。
伪监督近似器可以在成本高的 OD 模型上加速预测，几乎不损失准确性，在某些情况下甚至提高 ROC。
由模型成本预测器引导的平衡并行调度降低了执行时间并缓解了工作节点之间的负载不均。
完整的 SUOD 系统在异构 OD 加速方面提供综合增益，在超过 20 个基准数据集以及一个实际的 IQVIA 欺诈检测部署中得到验证。
SUOD 的开源版本支持可重复性，并可与 PyOD 和 scikit-learn 风格的 API 集成。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。