QUICK REVIEW

[Paper Review] In Defense of MinHash Over SimHash

Anshumali Shrivastava, Ping Li|arXiv (Cornell University)|Jul 16, 2014

Advanced Image and Video Retrieval Techniques27 references61 citations

TL;DR

This paper establishes theoretically and empirically that MinHash outperforms SimHash for approximate near neighbor search in binary data, despite MinHash being designed for resemblance similarity and SimHash for cosine similarity. By proving MinHash is a valid Locality-Sensitive Hashing scheme for cosine similarity using the bound $\mathcal{S}^2 \leq \mathcal{R} \leq \frac{\mathcal{S}}{2 - \mathcal{S}}$, the authors show MinHash achieves significantly better recall with far fewer data scans—e.g., 0.6% vs. 5% for 90% recall on MNIST—making it superior even when evaluated on cosine similarity.

ABSTRACT

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity ($\mathcal{R}$), while the collision probability of SimHash is a function of cosine similarity ($\mathcal{S}$). To provide a common basis for comparison, we evaluate retrieval results in terms of $\mathcal{S}$ for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to $\mathcal{S}$, by using a general inequality $\mathcal{S}^2\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}$. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often $\mathcal{R}\geq \frac{\mathcal{S}}{z-\mathcal{S}}$ holds where $z$ is only slightly larger than 2 (e.g., $z\leq 2.1$). Our restricted worst case analysis by assuming $\frac{\mathcal{S}}{z-\mathcal{S}}\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}$ shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.

Motivation & Objective

To resolve the longstanding question of whether MinHash or SimHash is preferable for approximate near neighbor search in large-scale binary data, common in web and search applications.
To establish a theoretical foundation for comparing MinHash (designed for resemblance) and SimHash (designed for cosine similarity) by proving MinHash is a valid LSH for cosine similarity.
To empirically evaluate and compare the retrieval performance of MinHash and SimHash under the same metric—cosine similarity—on both binarized and original real-valued data.
To demonstrate that MinHash's advantages persist even when evaluated under conditions favoring SimHash, such as using original real-valued data.

Proposed method

Derive and prove the inequality $\mathcal{S}^2 \leq \mathcal{R} \leq \frac{\mathcal{S}}{2 - \mathcal{S}}$, which bounds resemblance similarity $\mathcal{R}$ in terms of cosine similarity $\mathcal{S}$, enabling direct comparison of MinHash and SimHash under the same metric.
Use the bound $\mathcal{R} \geq \frac{\mathcal{S}}{z^* - \mathcal{S}}$ under the assumption $z \leq z^*$, where $z = \sqrt{f_2/f_1} + \sqrt{f_1/f_2}$, to analyze performance in practical data with limited size ratio variance.
Implement and compare MinHash on binarized data and SimHash on original or binarized data using varying $K$ (number of hash functions per table) and $L$ (number of tables) to find optimal parameter settings.
Evaluate retrieval performance using cosine similarity on both binarized and original real-valued data, measuring recall at top-$k$ results and fraction of data scanned.
Conduct extensive experiments on six binarized datasets (MNIST, RCV1, etc.) and two original real-valued datasets to validate theoretical findings across diverse data regimes.

Experimental results

Research questions

RQ1Can MinHash be rigorously shown to be a valid Locality-Sensitive Hashing scheme for cosine similarity, despite being designed for resemblance similarity?
RQ2How does MinHash compare to SimHash in terms of retrieval performance when both are evaluated using cosine similarity as the metric?
RQ3Does MinHash maintain its superiority over SimHash in low-similarity regimes, where the theoretical advantage is less obvious?
RQ4Does the performance advantage of MinHash persist when evaluated on original real-valued data, placing it at a disadvantage relative to SimHash?

Key findings

MinHash significantly outperforms SimHash in high similarity regions, with theoretical bounds showing the advantage is most pronounced when $\mathcal{S} \approx 1$.
On the MNIST dataset, MinHash achieves 90% recall for top-1 retrieval by scanning only 0.6% of the data, while SimHash requires scanning 5% under optimal parameter tuning.
Even in low-similarity regimes, MinHash outperforms SimHash due to the practical data property $\mathcal{R} \geq \frac{\mathcal{S}}{z - \mathcal{S}}$ with $z \leq 2.1$, which strengthens MinHash's performance.
When evaluated on original real-valued data, MinHash still outperforms SimHash despite being applied to binarized data, indicating robustness and general superiority.
The theoretical bound $\mathcal{S}^2 \leq \mathcal{R} \leq \frac{\mathcal{S}}{2 - \mathcal{S}}$ is tight and cannot be improved without additional assumptions, validating its use for comparison.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.