QUICK REVIEW

[论文解读] Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining

Sami Hanhijärvi, Markus Ojala|arXiv (Cornell University)|Jun 16, 2020

Data Mining Algorithms and Applications参考文献 16被引用 72

一句话总结

论文介绍了能够保持已发现模式的概率数据随机化方法，使迭代数据挖掘中的显著性检验成为可能。它表明在空模型中保留先前结果可以改变对新模式和结构的推断显著性。

ABSTRACT

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

研究动机与目标

激发评估一种数据挖掘方法的结果是否在早期分析基础上提供额外信息的需求。
开发基于随机化的零假设模型，能够保持先前发现的模式或模型。
通过将原始结果与尊重先前发现的随机化数据集进行比较，使迭代数据挖掘中的显著性检验成为可能。

提出的方法

定义总结数据挖掘任务结果的结构性度量。
使用带局部交换的 Metropolis 采样来生成保留特定统计量的随机数据集。
为边际、聚类和项集频率提供精确（ExactRand）和软性（SoftRand）随机化问题。
通过将原始结构性度量与随机数据集下的分布进行比较来计算经验 p 值。
通过证明精确的项集-边际保留在一般情形下的困难性来应对复杂性，并提出 SoftRand 作为实用替代方案。
描述用于保留边际、聚类结构和项集频率（SoftRand）的算法，并采用基于交换的 MCMC 方法。

实验结果

研究问题

RQ1如何确定已发现的模式或簇是否提供了超越先前观察结构的信息？
RQ2我们能否生成保留已知统计量（边际、聚类中心、项集频率）的随机数据集，以在迭代挖掘中检验显著性？
RQ3在零假设模型中保留先前结果如何影响新发现的模式或簇的显著性？

主要发现

保留先前分析的随机化可以改变经验 p 值，在考虑先前频率时，有时使较大模式不显著。
聚类结果若仅对边际进行检验可能显著，但当同时保留项集频率时可能失去显著性。
本研究表明在一般情况下项集-边际的保留在计算上很困难，这促使采用 SoftRand 的方法。
基于 Metropolis 的 SoftRand 提供了一种在保持可处理计算的同时，对项集频率进行近似保留的实用方法。
在真实数据实验中，保留先前模式常揭示聚类与项集模式之间的依赖性，从而影响显著性结论。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。