QUICK REVIEW

[論文レビュー] Mass Volume Curves and Anomaly Ranking

Stéphan Clémençon, Albert Thomas|arXiv (Cornell University)|May 3, 2017

Advanced Statistical Methods and Models参考文献 31被引用数 12

ひとこと要約

本稿は、教師なし異常順序付けのための機能的性能基準として質量体積（MV）曲線を導入し、異常スコアリングをM-推定問題として定式化する。最小体積集合の自己適合的推定を用いて、データ駆動型の区分的定数スコアリング関数を構築する手法を提案し、経験的MV曲線と最適MV曲線の間で一様ノルムにおける一般化バウンドを達成する。スムーズドブートストラップを用いた信頼領域の理論的保証が与えられる。

ABSTRACT

This paper aims at formulating the issue of ranking multivariate unlabeled observations depending on their degree of abnormality as an unsupervised statistical learning task. In the 1-d situation, this problem is usually tackled by means of tail estimation techniques: univariate observations are viewed as all the more `abnormal' as they are located far in the tail(s) of the underlying probability distribution. It would be desirable as well to dispose of a scalar valued `scoring' function allowing for comparing the degree of abnormality of multivariate observations. Here we formulate the issue of scoring anomalies as a M-estimation problem by means of a novel functional performance criterion, referred to as the Mass Volume curve (MV curve in short), whose optimal elements are strictly increasing transforms of the density almost everywhere on the support of the density. We first study the statistical estimation of the MV curve of a given scoring function and we provide a strategy to build confidence regions using a smoothed bootstrap approach. Optimization of this functional criterion over the set of piecewise constant scoring functions is next tackled. This boils down to estimating a sequence of empirical minimum volume sets whose levels are chosen adaptively from the data, so as to adjust to the variations of the optimal MV curve, while controling the bias of its approximation by a stepwise curve. Generalization bounds are then established for the difference in sup norm between the MV curve of the empirical scoring function thus obtained and the optimal MV curve.

研究の動機と目的

新しい機能的基準を用いて、多次元異常順序付けを教師なしM-推定問題として定式化すること。
異常検出におけるスコアリング関数の比較を可能にする性能測定基準「質量体積（MV）曲線」を定義すること。
ラベルなしデータからほぼ最適なスコアリング関数を構築する統計的学習手順を開発すること。
学習済みスコアリング関数の経験的MV曲線の一般化バウンドを確立すること。
与えられたスコアリング関数のMV曲線の信頼領域を計算的に実行可能に構築するスムーズドブートストラップ法を提供すること。

提案手法

最適MV曲線が潜在密度の厳密に増加する変換に一致することを考慮し、異常スコアリング関数の評価に質量体積（MV）曲線を機能的基準として提案する。
スムーズドブートストラップを用いて、与えられたスコアリング関数のMV曲線の信頼領域を推定し、一貫性結果と収束速度解析を提供する。
最小体積集合推定のための信頼水準を自己適合的に選択するアルゴリズムを設計し、最適MV曲線の形状に一致させる。
推定された最小体積集合に基づいて区分的定数スコアリング関数を構築し、経験的MV曲線が最適曲線を近似することを保証する。
学習済み関数の経験的MV曲線と最適MV曲線の間で、一様ノルムにおける一般化バウンドを確立し、学習精度を定量的に評価する。
カーネル密度推定とバンド幅選択を用いて、スコア密度およびその微分を非パラメトリックに推定し、MV曲線の構築を可能にする。

実験結果

リサーチクエスチョン

RQ1多次元高次元設定における異常順序付けは、どのように機能的M-推定問題として定式化できるか？
RQ2MV曲線の観点で最適なスコアリング関数とは何か？そして、それは潜在データ密度とどのように関係するか？
RQ3計算的に実行可能な方法で、スコアリング関数のMV曲線の信頼領域を構築できるか？
RQ4自己適合的最小体積集合推定を用いて、ラベルなしデータからほぼ最適なスコアリング関数をどのように学習できるか？
RQ5経験的MV曲線と最適MV曲線の差に対して、どのような一般化バウンドを確立できるか？

主な発見

最適スコアリング関数は、その定義域上 almost everywhere で、潜在確率密度の厳密に増加する変換である。
自己適合的最小体積集合推定を用いて構築された経験的スコアリング関数のMV曲線は、標本サイズとバンド幅に依存するレートで、一様ノルムにおいて最適MV曲線に収束する。
MV曲線の信頼領域に対するスムーズドブートストラップ法は一貫しており、ナイーブブートストラップを上回り、実用的利用を支える収束速度を有する。
アルゴリズムの一般化誤差は一様ノルムでバウンドされ、そのバウンドはカーネルのVC特性と密度の滑らかさに依存する。
最適MV曲線の微分は、従来のものよりも単純な式で表され、密度の微分と直接関係していることが示された。
提案手法は、最小体積集合の信頼水準選択において適応性を有し、未知の最適MV曲線形状のより良い近似を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。