[论文解读] Mass Volume Curves and Anomaly Ranking
本文提出质量体积(MV)曲线作为无监督异常排序的功能性能准则,将异常评分建模为M-估计问题。提出一种数据驱动方法,通过自适应估计最小体积集来构建分段常数评分函数,实现了经验MV曲线与最优MV曲线之间在一致范数下的泛化界,并通过平滑自助法提供置信区域的理论保证。
This paper aims at formulating the issue of ranking multivariate unlabeled observations depending on their degree of abnormality as an unsupervised statistical learning task. In the 1-d situation, this problem is usually tackled by means of tail estimation techniques: univariate observations are viewed as all the more `abnormal' as they are located far in the tail(s) of the underlying probability distribution. It would be desirable as well to dispose of a scalar valued `scoring' function allowing for comparing the degree of abnormality of multivariate observations. Here we formulate the issue of scoring anomalies as a M-estimation problem by means of a novel functional performance criterion, referred to as the Mass Volume curve (MV curve in short), whose optimal elements are strictly increasing transforms of the density almost everywhere on the support of the density. We first study the statistical estimation of the MV curve of a given scoring function and we provide a strategy to build confidence regions using a smoothed bootstrap approach. Optimization of this functional criterion over the set of piecewise constant scoring functions is next tackled. This boils down to estimating a sequence of empirical minimum volume sets whose levels are chosen adaptively from the data, so as to adjust to the variations of the optimal MV curve, while controling the bias of its approximation by a stepwise curve. Generalization bounds are then established for the difference in sup norm between the MV curve of the empirical scoring function thus obtained and the optimal MV curve.
研究动机与目标
- 将多变量异常排序建模为使用新颖功能准则的无监督M-估计问题。
- 定义一种性能度量——质量体积(MV)曲线,以实现异常检测中评分函数的比较。
- 开发一种统计学习过程,从未标记数据中构建近似最优的评分函数。
- 建立所学习评分函数的经验MV曲线与最优MV曲线之间的泛化界。
- 提供一种计算上可行的平滑自助法,用于构建给定评分函数MV曲线的置信区域。
提出的方法
- 提出质量体积(MV)曲线作为评估异常评分函数的功能准则,其中最优曲线对应于底层密度的严格递增变换。
- 使用平滑自助法估计给定评分函数MV曲线的置信区域,提供一致性结果与收敛速率分析。
- 设计一种数据驱动算法,自适应选择最小体积集估计的置信水平,以匹配最优MV曲线的形状。
- 基于估计的最小体积集构建分段常数评分函数,确保经验MV曲线逼近最优曲线。
- 在所学习函数的经验MV曲线与最优MV曲线之间建立一致范数下的泛化界,量化学习精度。
- 应用带宽选择的核密度估计非参数地估计评分密度及其导数,以支持MV曲线的构建。
实验结果
研究问题
- RQ1如何在高维多变量设置中将异常排序建模为功能M-估计问题?
- RQ2在MV曲线意义上,最优评分函数是什么?它与底层数据密度有何关系?
- RQ3能否以计算上可行的方式构建评分函数MV曲线的置信区域?
- RQ4如何利用自适应最小体积集估计从未标记数据中学习近似最优的评分函数?
- RQ5可以为经验MV曲线与最优MV曲线之间差异建立何种泛化界?
主要发现
- 最优评分函数几乎处处是底层概率密度的严格递增变换。
- 通过自适应最小体积集估计构建的经验评分函数的MV曲线,以样本大小和带宽决定的速率,在一致范数下收敛于最优MV曲线。
- 用于MV曲线置信区域的平滑自助法具有一致性,且优于朴素自助法,其收敛速率支持其在实际中的应用。
- 该算法的泛化误差在一致范数下有界,其界依赖于核的VC特性与密度的光滑性。
- 最优MV曲线的导数被证明具有比以往已知更简单的公式,且可直接关联至密度的导数。
- 所提方法在最小体积集置信水平选择上实现了自适应,从而能够更好地逼近未知的最优MV曲线形状。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。