QUICK REVIEW

[論文レビュー] Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining

Sami Hanhijärvi, Markus Ojala|arXiv (Cornell University)|Jun 16, 2020

Data Mining Algorithms and Applications参考文献 16被引用数 72

ひとこと要約

この論文は、既に発見されたパターンを保持する確率的データ乱択法を導入し、反復的データマイニングにおける有意性検定を可能にします。nullモデルに前の結果を保持することで、新しいパターンや構造の推定有意性が変化することを示します。

ABSTRACT

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

研究の動機と目的

1つのデータマイニング手法の結果が、それ以前の分析を超える情報を付加しているかを評価する必要性を動機づける。
以前に発見されたパターンやモデルを保持する無作為化ベースのノullモデルを開発する。
以前の知見を尊重する無作為化データセットと元の結果を比較して、反復的データマイニングにおける有意性検定を可能にする。

提案手法

データマイニングタスクの結果を要約する構造的測度を定義する。
局所スワップを用いたMetropolisサンプリングを使い、指定された統計を保持する無作為データ集合を生成する。
マージン、クラスタリング、アイテムセット頻度のための厳密（ExactRand）およびソフト（SoftRand）乱択問題を提供する。
元の構造的測度を乱択データセットの分布と比較して経験的p値を計算する。
厄介さには、厳密なアイテムセット-マージン保持の困難性を証明し、実用的な代替としてSoftRandを提案することで対応する。
マージン、クラスタリング構造、およびアイテムセット頻度を保持するアルゴリズム（SoftRand）を説明し、スワップベースのMCMC手法を用いる。

実験結果

リサーチクエスチョン

RQ1発見されたパターンやクラスタが、以前に観察された構造を超える情報を提供しているかをどのように判断できるか？
RQ2反復的マイニングにおいて、既知の統計量（マージン、クラスタ中心、アイテムセット頻度）を保持する乱択データセットを生成して有意性を検証できるか？
RQ3nullモデルで前の結果を保持することが、新たに発見されたパターンやクラスタの有意性にどう影響するか？

主な発見

前の解析を保持する乱択は経験的p値を変化させることがあり、事前の頻度を考慮するとより大きなパターンが有意でなくなることがある。
クラスタリング結果はマージンのみで検定した場合有意に見えることがあるが、アイテムセット頻度も保持すると有意性を失うことがある。
本研究は、アイテムセット-マージンの保持が一般に計算的に困難であることを示し、SoftRandアプローチを動機づける。
MetropolisベースのSoftRandは、計算を扱いやすいままにしつつ、アイテムセット頻度の保持を近似する実用的な方法を提供する。
実データの実験では、以前のパターンを保持することで、クラスタリングとアイテムセットパターン間の依存関係が明らかになり、有意性の結論に影響を与えることが多い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。