QUICK REVIEW

[論文レビュー] Statistical Analysis of Nearest Neighbor Methods for Anomaly Detection

Xiaoyi Gu, Leman Akoglu|arXiv (Cornell University)|Jul 8, 2019

Anomaly Detection Techniques and Applications参考文献 26被引用数 50

ひとこと要約

本論文はNNベースの異常検知器を経験的に比較し、distance-to-a-measure (DTM)を用いた理論的枠組みを提供し、Huber汚染モデルの下で有限サンプル保証を示している。

ABSTRACT

Nearest-neighbor (NN) procedures are well studied and widely used in both supervised and unsupervised learning problems. In this paper we are concerned with investigating the performance of NN-based methods for anomaly detection. We first show through extensive simulations that NN methods compare favorably to some of the other state-of-the-art algorithms for anomaly detection based on a set of benchmark synthetic datasets. We further consider the performance of NN methods on real datasets, and relate it to the dimensionality of the problem. Next, we analyze the theoretical properties of NN-methods for anomaly detection by studying a more general quantity called distance-to-measure (DTM), originally developed in the literature on robust geometric and topological inference. We provide finite-sample uniform guarantees for the empirical DTM and use them to derive misclassification rates for anomalous observations under various settings. In our analysis we rely on Huber's contamination model and formulate mild geometric regularity assumptions on the underlying distribution of the data.

研究の動機と目的

Unsupervised settings on synthetic and real datasets.
Compare NN methods to state-of-the-art detectors such as Isolation Forest, LOF, and LODA.
Develop a statistical framework based on distance-to-a-measure (DTM) to understand NN methods theoretically.
Provide finite-sample bounds for empirical NN radii and DTM convergence.
Characterize conditions under which DTM-based methods can reliably separate normal and anomalous points.

提案手法

Analyze two NN anomaly detectors: k-NN (average distance to k neighbors) and k-th NN (distance to k-th neighbor).
Introduce and utilize the distance-to-a-measure (DTM) functional as a generalization of NN methods (DTM_q with q≥1; DTM_2 corresponds to the standard DTM with q=2).
Establish population and empirical definitions of p-NN radii r_p(x) and empirical radii ˆr_p(x) based on P and P_n.
Prove finite-sample uniform bounds for ˆr_p(x) and ˆd(x) (empirical DTM) under assumptions A0-A2 and A1.
Derive misclassification/separation guarantees for DTM-based anomaly detection under the Huber contamination model.
Provide supplemental proofs and discuss high-dimensional behavior and safety zones where normal points are correctly classified.

実験結果

リサーチクエスチョン

RQ1NN-based anomaly detectors (k-NN, k-th NN, and DTM_2) are competitive with established methods (Isolation Forest, LOF, LODA, etc.) across synthetic and real datasets?
RQ2What theoretical guarantees can be provided for the empirical DTM to separate normal from anomalous observations under Huber contamination?
RQ3Under mild regularity conditions, how do the empirical NN radii and the empirical DTM converge to their population counterparts as sample size grows?
RQ4How does data dimensionality affect the performance and reliability of NN-based anomaly detection methods?
RQ5What are the precise finite-sample bounds that ensure correct classification in regions deep inside the normal support (safety zone) versus near boundaries?

主な発見

アルゴリズム	AUC	AP	いずれか
ABOD	0.5898	0.6784	0.7000
IForest	0.5520	0.6514	0.6741
LODA	0.6187	0.6955	0.7194
LOF	0.6016	0.7071	0.7331
RKDE	0.6122	0.7030	0.7194
OCSVM	0.7218	0.7342	0.7969
SVDD	0.8482	0.8868	0.9080
EGMM	0.6188	0.7146	0.7303
kNN	0.5646	0.6744	0.6960
k-th NN	0.5831	0.6886	0.7100
DTM_2	0.5669	0.6761	0.6977

NNベースの検知器（k-NN、k-th NN、および DTM_2）は、ベンチマークの合成データセットおよび現実のODDS/UCIデータセットにおいて最先端手法と競合する性能を示す。
Isolation Forestは多くのデータセットで最も低い失敗率（AUC, AP）を達成する傾向があり、NN手法はそれに続く。高次元データではNN手法とLOF/DTMの変種が特定のケースでより良い性能を示す。
高次元実験では、NN手法が一部設定でIsolation Forestを上回ることがあり、LOFおよびDTMベースの変種が他の設定で有利になる可能性がある。
論文は empirical NN radii および DTM の一様有限サンプル境界を導入し、適切な条件（A1、次元dの依存を含む）下での収束を保証する。
決定論的な分離結果として、正常なサポート内の深部にある安全領域 A_eta が存在し、母集団の DTM が異常サポートから分離することを示し、十分なサンプル数と適切な分離 eta があれば A_eta 内で高確率の正分類が得られる。
推定された DTM の境界から、実践的な分類保証に結びつく系としての系当量が示され、十分なサンプル数と正規性があれば A_eta のすべての点で高確率の正確性が保証される、という系を補足定理が結びつける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。