[论文解读] Evaluation on Entity Matching in Recommender Systems
本文推出 Reddit-Amazon-EM,这是一个用于跨数据集实体匹配的庞大手工标注数据集,将 Reddit 的电影提及与 Amazon 电影条目进行对照,并基于多种 EM 方法进行基准测试,结果显示基于图的和增强型大型语言模型的方法优于传统基线;还分析 EM 对基于 LLM 的对话式推荐系统的影响。
Entity matching is a crucial component in various recommender systems, including conversational recommender systems (CRS) and knowledge-based recommender systems. However, the lack of rigorous evaluation frameworks for cross-dataset entity matching impedes progress in areas such as LLM-driven conversational recommendations and knowledge-grounded dataset construction. In this paper, we introduce Reddit-Amazon-EM, a novel dataset comprising naturally occurring items from Reddit and the Amazon '23 dataset. Through careful manual annotation, we identify corresponding movies across Reddit-Movies and Amazon'23, two existing recommender system datasets with inherently overlapping catalogs. Leveraging Reddit-Amazon-EM, we conduct a comprehensive evaluation of state-of-the-art entity matching methods, including rule-based, graph-based, lexical-based, embedding-based, and LLM-based approaches. For reproducible research, we release our manually annotated entity matching gold set and provide the mapping between the two datasets using the best-performing method from our experiments. This serves as a valuable resource for advancing future work on entity matching in recommender systems.Data and Code are accessible at: https://github.com/huang-zihan/Reddit-Amazon-Entity-Matching.
研究动机与目标
- Introduce Reddit-Amazon-EM, the largest publicly available knowledge-grounded EM dataset linking Reddit movie mentions to Amazon catalog entries.
- Provide a rigorous evaluation of diverse EM methods (rule-based, lexical, embedding-based, graph-based, and LLM-based) on cross-dataset linking.
- Analyze how EM quality affects downstream LLM-driven conversational recommendation systems.
- Enable reproducible research by releasing annotated gold data and evaluation code.
提出的方法
- Construct Reddit-Amazon-EM by linking ~4k Amazon Movie entries to Reddit movie mentions with manual annotation.
- Retrieve candidate Amazon entries for each Reddit title using title-based similarity and metadata filtering.
- Use a Streamlit interface with supporting metadata and prompts (e.g., GPT-3.5) to manually confirm correct matches.
- Evaluate multiple EM baselines: BM25, Faiss, Embedding+Fuzzy, GNEM, ComEM.
- Assess performance with Recall@k, Precision@k, F1, and Accuracy, plus computational efficiency.
- Provide an annotated gold set and dataset-candidate mappings for reproducibility.
实验结果
研究问题
- RQ1 What is the effectiveness of different entity matching methods on cross-dataset movie linking between Reddit and Amazon?
- RQ2 Which EM methods best handle noisy, in-the-wild movie titles and data heterogeneity?
- RQ3 How does EM quality translate to performance in LLM-driven conversational recommender systems?
- RQ4 What are the trade-offs between precision, recall, and efficiency across EM approaches?
主要发现
| Model | Precision | Recall@1 | F1 score | Accuracy |
|---|---|---|---|---|
| Emb+Fuzzy | 86.38 ±0.03 | 86.99 ±0.04 | 86.68 ±0.03 | 92.78 ±0.02 |
| BM25 | 74.93 ±0.04 | 82.30 ±0.04 | 78.43 ±0.03 | 89.71 ±0.02 |
| Faiss | 60.51 ±0.04 | 89.76 ±0.03 | 72.28 ±0.04 | 91.83 ±0.02 |
| BM25 + Faiss | 74.54 ±0.04 | 84.49 ±0.04 | 79.20 ±0.03 | 90.75 ±0.02 |
| ComEM | 94.50 ±0.04 | 93.97 ±0.04 | 94.02 ±0.04 | 94.70 ±0.04 |
| GNEM | 95.82 ±0.02 | 96.78 ±0.02 | 96.29 ±0.01 | 96.74 ±0.01 |
- GNEM (graph-based) achieves the highest performance (F1 96.29%, Accuracy 96.74%).
- ComEM (LLM-enhanced) closely follows GNEM with F1 94.02% and Accuracy 94.70%.
- Embedding+Fuzzy is stronger than traditional baselines but substantially behind GNEM and ComEM (F1 86.68%).
- BM25 and Faiss lag behind with various precision/recall trade-offs (e.g., Faiss shows high recall but low precision).
- Traditional methods are fast to initialize but slow at inference; GNEM and ComEM offer better overall efficiency after setup; Emb+Fuzzy balances speed with performance.
- On CRS dialogue tasks, GNEM remains the top performer, with robustness to conversational variation, though gains over smaller LLMs diminish in dialogue settings.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。