QUICK REVIEW

[논문 리뷰] In Defense of MinHash Over SimHash

Anshumali Shrivastava, Ping Li|arXiv (Cornell University)|2014. 07. 16.

Advanced Image and Video Retrieval Techniques참고 문헌 27인용 수 61

한 줄 요약

이 논문은 MinHash가 이분 데이터에서 유사 근접 검색에 SimHash보다 뛰어나다는 것을 이론적으로와 실험적으로 입증한다. MinHash는 유사도 유사도를 위해 설계되었고 SimHash는 코사인 유사도를 위해 설계되었지만, MinHash가 코사인 유사도에 대해 유효한 국소성 감지 해싱(Locality-Sensitive Hashing)임을 증명함으로써, 유사도 유사도 $\mathcal{R}$ 와 코사인 유사도 $\mathcal{S}$ 사이의 경계 $\mathcal{S}^2 \leq \mathcal{R} \leq \frac{\mathcal{S}}{2 - \mathcal{S}}$ 를 사용한다. 이로 인해 MinHash는 훨씬 적은 데이터 스캔 수로도 높은 재현율을 달성한다 — 예를 들어, MNIST에서 90% 재현율을 위해 MinHash는 0.6%의 데이터 스캔으로 달성할 수 있는 반면 SimHash는 5%가 필요하다. 이는 코사인 유사도 기준으로 평가할 때조차 MinHash가 열등한 조건에서도 뛰어난 성능을 보임을 의미한다.

ABSTRACT

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity ($\mathcal{R}$), while the collision probability of SimHash is a function of cosine similarity ($\mathcal{S}$). To provide a common basis for comparison, we evaluate retrieval results in terms of $\mathcal{S}$ for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to $\mathcal{S}$, by using a general inequality $\mathcal{S}^2\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}$. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often $\mathcal{R}\geq \frac{\mathcal{S}}{z-\mathcal{S}}$ holds where $z$ is only slightly larger than 2 (e.g., $z\leq 2.1$). Our restricted worst case analysis by assuming $\frac{\mathcal{S}}{z-\mathcal{S}}\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}$ shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.

연구 동기 및 목표

웹 및 검색 애플리케이션에서 흔한 대규모 이분 데이터에서 약한 근접 검색에 MinHash와 SimHash 중 어느 것이 더 바람직한지 오랫동안 남아있던 질문을 해결하기 위해.
MinHash는 유사도 유사도를 위해 설계되었고 SimHash는 코사인 유사도를 위해 설계되었지만, MinHash가 코사인 유사도에 대해 유효한 LSH임을 증명함으로써 MinHash와 SimHash를 비교할 수 있는 이론적 기반을 마련하기 위해.
이중 코사인 유사도를 동일한 메트릭으로 사용하여, 이진화된 데이터와 원본 실수형 데이터에서 MinHash와 SimHash의 검색 성능을 실험적으로 평가하고 비교하기 위해.
원본 실수형 데이터를 사용하는 조건에서 SimHash 유리한 환경에서도 MinHash의 우월성이 유지되는지 보여주기 위해.

제안 방법

코사인 유사도 $\mathcal{S}$ 와 유사도 유사도 $\mathcal{R}$ 사이의 관계를 묶는 불등식 $\mathcal{S}^2 \leq \mathcal{R} \leq \frac{\mathcal{S}}{2 - \mathcal{S}}$ 를 유도하고 증명함으로써, MinHash와 SimHash를 동일한 메트릭 기준으로 직접 비교할 수 있도록 한다.
크기가 제한된 비율 분산을 가진 실질적 데이터에서 성능을 분석하기 위해, $z \leq z^*$ 라는 가정 하에 $\mathcal{R} \geq \frac{\mathcal{S}}{z^* - \mathcal{S}}$ 라는 경계를 사용한다. 여기서 $z = \sqrt{f_2/f_1} + \sqrt{f_1/f_2}$ 이다.
다양한 $K$(테이블당 해시 함수 수)와 $L$(테이블 수)를 사용하여 이진화된 데이터에 대한 MinHash와 원본 또는 이진화된 데이터에 대한 SimHash를 구현하고, 최적의 파라미터 설정을 찾는다.
이진화된 데이터와 원본 실수형 데이터 양쪽에서 코사인 유사도를 사용하여 검색 성능을 평가하며, 상위-$k$ 결과의 재현율과 데이터 스캔 비율을 측정한다.
여섯 개인 이진화된 데이터셋(MNIST, RCV1 등)과 두 개의 원본 실수형 데이터셋을 대상으로 광범위한 실험을 수행하여 다양한 데이터 환경에서 이론적 결과의 타당성을 검증한다.

실험 결과

연구 질문

RQ1MinHash가 유사도 유사도를 위해 설계되었지만, 코사인 유사도에 대해 유효한 국소성 감지 해싱(Locality-Sensitive Hashing)임을 엄밀히 증명할 수 있는가?
RQ2코사인 유사도를 메트릭으로 사용할 때, MinHash와 SimHash의 검색 성능는 어떻게 비교되는가?
RQ3이론적 우위가 덜 두드러지는 낮은 유사도 영역에서도 MinHash는 SimHash를 능가하는가?
RQ4원본 실수형 데이터에서 평가할 경우, MinHash는 SimHash에 비해 열등한 조건에 놓여지지만 여전히 성능 우위를 유지하는가?

주요 결과

고유사도 영역에서는 MinHash가 SimHash를 뚜렷이 능가하며, 이론적 경계에 따르면 $\mathcal{S} \approx 1$ 인 경우에 이 우위가 가장 두드러진다.
MNIST 데이터셋에서 MinHash는 상위-1 검색에서 0.6%의 데이터 스캔으로 90% 재현율을 달성하지만, SimHash는 최적의 파라미터 설정 조건에서 5%의 데이터 스캔이 필요하다.
낮은 유사도 영역에서도 실질적 데이터 성질인 $\mathcal{R} \geq \frac{\mathcal{S}}{z - \mathcal{S}}$ (여기서 $z \leq 2.1$) 덕분에 MinHash는 SimHash를 능가한다.
원본 실수형 데이터에서 평가할 경우, MinHash는 이진화된 데이터에 적용되었음에도 불구하고 여전히 SimHash를 능가한다. 이는 MinHash의 강건성과 일반적인 우월성을 시사한다.
이론적 경계 $\mathcal{S}^2 \leq \mathcal{R} \leq \frac{\mathcal{S}}{2 - \mathcal{S}}$ 는 타당성이 있으며, 추가적인 가정 없이도 향상될 수 없으므로 비교에 사용하기에 적합하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.