[Paper Review] Using Word Embeddings for Automatic Query Expansion
This paper proposes a query expansion method using word2vec embeddings to improve ad-hoc information retrieval by retrieving semantically related terms via k-nearest neighbors in the embedding space. Despite outperforming baseline methods, the approach falls significantly short of RM3, a statistical feedback-based method, indicating that semantic similarity from word embeddings alone is less effective than co-occurrence statistics for query expansion.
In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neural language model word2vec. Using semantic and contextual relation in a distributed and unsupervised framework, word2vec learns a low dimensional embedding for each vocabulary entry. Using such a framework, we devise a query expansion technique, where related terms to a query are obtained by K-nearest neighbor approach. We explore the performance of the AQE methods, with and without feedback query expansion, and a variant of simple K-nearest neighbor in the proposed framework. Experiments on standard TREC ad-hoc data (Disk 4, 5 with query sets 301-450, 601-700) and web data (WT10G data with query set 451-550) shows significant improvement over standard term-overlapping based retrieval methods. However the proposed method fails to achieve comparable performance with statistical co-occurrence based feedback method such as RM3. We have also found that the word2vec based query expansion methods perform similarly with and without any feedback information.
Motivation & Objective
- To investigate whether word embeddings can enhance automatic query expansion (AQE) in ad-hoc retrieval.
- To evaluate the effectiveness of k-nearest neighbor (kNN) expansion using word2vec embeddings, both with and without relevance feedback.
- To compare embedding-based AQE methods against established feedback-based techniques like RM3.
- To analyze whether embedding-based expansion performs consistently across different query types.
- To explore the potential of combining word embeddings with co-occurrence statistics for improved AQE.
Proposed method
- Word2vec is used to generate dense, low-dimensional vector representations for all words in the vocabulary, capturing semantic and syntactic relationships.
- For query expansion, the k-nearest neighbors (kNN) of each query term are retrieved in the embedding space using cosine similarity.
- Candidate expansion terms are selected based on their average cosine similarity to all query terms, forming an expanded query set.
- Three variants are evaluated: pre-retrieval kNN (no feedback), post-retrieval kNN (feedback-based search space), and incremental kNN (iterative refinement).
- The incremental method computes neighbors iteratively, pruning the search space based on relevance feedback, improving efficiency and focus.
- Retrieval effectiveness is evaluated using standard metrics like MAP and P@10 on TREC ad-hoc (Disk 4,5) and WT10G web datasets.
Experimental results
Research questions
- RQ1Does query expansion using kNN of word2vec embeddings improve retrieval effectiveness compared to baseline methods?
- RQ2Can the performance of embedding-based query expansion be improved by incorporating relevance feedback?
- RQ3How does the performance of word2vec-based AQE compare to the established RM3 feedback method?
- RQ4Are there specific query types for which embedding-based expansion works better or worse?
- RQ5Can the combination of word embeddings and co-occurrence statistics further enhance AQE performance?
Key findings
- The proposed word2vec-based query expansion methods significantly improve retrieval performance over the unexpanded baseline on both TREC ad-hoc and WT10G web datasets.
- The pre-retrieval and post-retrieval kNN methods perform similarly, with no statistically significant difference, indicating that feedback does not enhance the embedding-based similarity measure.
- The incremental kNN method achieves the best performance among embedding-based approaches, with a MAP of 0.2956 on the TREC 451-550 set, significantly outperforming the baseline.
- Despite improvements, all embedding-based methods are substantially outperformed by RM3, which achieves a MAP of 0.3304 on the same dataset, indicating that co-occurrence statistics are more effective than semantic similarity alone.
- The incremental method is generally safe, improving performance on most queries and harming only a few, as shown in query-by-query analysis.
- The study finds that word2vec embeddings alone fail to capture the co-occurrence patterns critical for effective query expansion, which explains the performance gap with RM3.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.