QUICK REVIEW

[論文レビュー] Clustering Via Crowdsourcing

Arya Mazumdar, Barna Saha|arXiv (Cornell University)|Apr 7, 2016

Mobile Crowdsensing and Crowdsourcing参考文献 37被引用数 19

ひとこと要約

本稿では、ノイジィな類似度の補助情報と誤ったクラウド回答への耐性を活用することで、クエリの複雑さを著しく低減する、適応的で並列化可能なクラスタリング手法を提案する。ランダムサンプリング、メジャリティ投票、反復的クラスタ成長を組み合わせることで、ノイズのある条件下でもnに対して非線形のクエリとラウンドの複雑さを達成する。

ABSTRACT

In recent years, crowdsourcing, aka human aided computation has emerged as an effective platform for solving problems that are considered complex for machines alone. Using human is time-consuming and costly due to monetary compensations. Therefore, a crowd based algorithm must judiciously use any information computed through an automated process, and ask minimum number of questions to the crowd adaptively. One such problem which has received significant attention is {\em entity resolution}. Formally, we are given a graph $G=(V,E)$ with unknown edge set $E$ where $G$ is a union of $k$ (again unknown, but typically large $O(n^α)$, for $α>0$) disjoint cliques $G_i(V_i, E_i)$, $i =1, \dots, k$. The goal is to retrieve the sets $V_i$s by making minimum number of pair-wise queries $V imes V o\{\pm1\}$ to an oracle (the crowd). When the answer to each query is correct, e.g. via resampling, then this reduces to finding connected components in a graph. On the other hand, when crowd answers may be incorrect, it corresponds to clustering over minimum number of noisy inputs. Even, with perfect answers, a simple lower and upper bound of $Θ(nk)$ on query complexity can be shown. A major contribution of this paper is to reduce the query complexity to linear or even sublinear in $n$ when mild side information is provided by a machine, and even in presence of crowd errors which are not correctable via resampling. We develop new information theoretic lower bounds on the query complexity of clustering with side information and errors, and our upper bounds closely match with them. Our algorithms are naturally parallelizable, and also give near-optimal bounds on the number of adaptive rounds required to match the query complexity.

研究の動機と目的

人間によるペアワイズクエリの数を最小限に抑えることで、クラウドソーシングによるエンティティレゾリューションの高いクエリコストを解消すること。
標準的な連結成分回復における理論的ボトルネックであるΩ(nk)のクエリ複雑さを、機械生成の類似度補助情報を取り入れることで克服すること。
再サンプリングによる是正に依存せずに、誤差確率1/2−λのノイジィなクラウド回答に対しても耐性を持つアルゴリズムの設計。
スケーラブルでリアルタイムな応用に不可欠な並列実行モデルにおける近似的最適なラウンド複雑さの達成。
情報理論的下界を提示し、それらを上界と一致させることで、タイトな理論的保証を確立すること。

提案手法

f_+が同じクラスタに属するi,jに対して、f_-が異なるクラスタに属する場合にw_{i,j}が生成される、ノイジィな重み付き類似度行列Wを補助情報として使用。分布は未知。
√(n log n)個の頂点を一様にランダムにサンプリングし、すべてのペアワイズクエリを発行して初期クラスタ検出のための部分グラフG''を構築。
G''から正解が+1、負の回答が-1として、重みが最大の部分グラフSを抽出。未クラスタリングの各頂点に対してc log n回のクエリを用いてメジャリティ投票によりSを拡張。
ラウンドごとにクラスタ成長を繰り返し、クラスタサイズがc log n以上の場合には各クラスタでO(1)ラウンドの複雑さを維持。各成長フェーズにcラウンドを要する。
再帰的クラスタリングを適用：初期サンプリング後、未クラスタリングの頂点に対して再帰的に処理を実行し、すべてのクラスタを高確率で回復。
情報理論的解析を用いてクエリ複雑さとラウンド複雑さを上限づける。c = O(1/λ²)が誤差耐性を制御する。

実験結果

リサーチクエスチョン

RQ1補助情報が利用可能である場合、クラウドソーシング型クラスタリングにおけるクエリ複雑さをΘ(nk)未満に抑えることは可能か？
RQ2誤差確率1/2−λのノイジィなクラウド回答が、必要な最小クエリ数に与える影響は何か？
RQ3適応的かつ並列なクラウドソーシングアルゴリズムにおいて、クエリ数とラウンド複雑さの最適なトレードオフは何か？
RQ4f_+とf_-が未知であっても、補助情報がある場合に、非線形のクエリ複雑さを達成できるアルゴリズムは存在するか？
RQ5ノイジィな入力と補助情報を持つクラスタリングにおけるクエリ複雑さの根本的限界（下界）は何か？

主な発見

本稿では、完璧な回答がある場合でもクエリ複雑さがO(nk)である下界を確立しているが、補助情報があることでnに対して非線形にまで低下することを示している。
補助情報と完璧なオラクルがある場合、k = Ω(√n)またはk = O(√n / Δ(f_+||f_-))のとき、ラウンド複雑さは最適値のÕ(1)要因以内に収まる。
誤ったオラクル（誤差確率1/2−λ）がある場合、補助情報がなくても、ラウンド複雑さは最適値のÕ(√log n)要因以内に収まる。
すべての二項係数(n,2)のクエリを用いて、高確率で真のクラスタ構造の最尤推定値を回復する。
理論的解析により、クエリ複雑さが情報理論的限界によってタイトに束縛されており、上界と下界の差は最大でO(√(n log n)/k)であることが示された。
本手法は自然に並列化可能であり、各クラスタ成長フェーズはO(1)ラウンドで完了するため、効率的な分散実行が可能である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。