Skip to main content
QUICK REVIEW

[论文解读] Evaluation on Entity Matching in Recommender Systems

Zihan Huang, Rohan Surana|arXiv (Cornell University)|Jan 23, 2026
Topic Modeling被引用 0
一句话总结

本文推出 Reddit-Amazon-EM,这是一个用于跨数据集实体匹配的庞大手工标注数据集,将 Reddit 的电影提及与 Amazon 电影条目进行对照,并基于多种 EM 方法进行基准测试,结果显示基于图的和增强型大型语言模型的方法优于传统基线;还分析 EM 对基于 LLM 的对话式推荐系统的影响。

ABSTRACT

Entity matching is a crucial component in various recommender systems, including conversational recommender systems (CRS) and knowledge-based recommender systems. However, the lack of rigorous evaluation frameworks for cross-dataset entity matching impedes progress in areas such as LLM-driven conversational recommendations and knowledge-grounded dataset construction. In this paper, we introduce Reddit-Amazon-EM, a novel dataset comprising naturally occurring items from Reddit and the Amazon '23 dataset. Through careful manual annotation, we identify corresponding movies across Reddit-Movies and Amazon'23, two existing recommender system datasets with inherently overlapping catalogs. Leveraging Reddit-Amazon-EM, we conduct a comprehensive evaluation of state-of-the-art entity matching methods, including rule-based, graph-based, lexical-based, embedding-based, and LLM-based approaches. For reproducible research, we release our manually annotated entity matching gold set and provide the mapping between the two datasets using the best-performing method from our experiments. This serves as a valuable resource for advancing future work on entity matching in recommender systems.Data and Code are accessible at: https://github.com/huang-zihan/Reddit-Amazon-Entity-Matching.

研究动机与目标

  • Introduce Reddit-Amazon-EM, the largest publicly available knowledge-grounded EM dataset linking Reddit movie mentions to Amazon catalog entries.
  • Provide a rigorous evaluation of diverse EM methods (rule-based, lexical, embedding-based, graph-based, and LLM-based) on cross-dataset linking.
  • Analyze how EM quality affects downstream LLM-driven conversational recommendation systems.
  • Enable reproducible research by releasing annotated gold data and evaluation code.

提出的方法

  • Construct Reddit-Amazon-EM by linking ~4k Amazon Movie entries to Reddit movie mentions with manual annotation.
  • Retrieve candidate Amazon entries for each Reddit title using title-based similarity and metadata filtering.
  • Use a Streamlit interface with supporting metadata and prompts (e.g., GPT-3.5) to manually confirm correct matches.
  • Evaluate multiple EM baselines: BM25, Faiss, Embedding+Fuzzy, GNEM, ComEM.
  • Assess performance with Recall@k, Precision@k, F1, and Accuracy, plus computational efficiency.
  • Provide an annotated gold set and dataset-candidate mappings for reproducibility.

实验结果

研究问题

  • RQ1 What is the effectiveness of different entity matching methods on cross-dataset movie linking between Reddit and Amazon?
  • RQ2 Which EM methods best handle noisy, in-the-wild movie titles and data heterogeneity?
  • RQ3 How does EM quality translate to performance in LLM-driven conversational recommender systems?
  • RQ4 What are the trade-offs between precision, recall, and efficiency across EM approaches?

主要发现

ModelPrecisionRecall@1F1 scoreAccuracy
Emb+Fuzzy86.38 ±0.0386.99 ±0.0486.68 ±0.0392.78 ±0.02
BM2574.93 ±0.0482.30 ±0.0478.43 ±0.0389.71 ±0.02
Faiss60.51 ±0.0489.76 ±0.0372.28 ±0.0491.83 ±0.02
BM25 + Faiss74.54 ±0.0484.49 ±0.0479.20 ±0.0390.75 ±0.02
ComEM94.50 ±0.0493.97 ±0.0494.02 ±0.0494.70 ±0.04
GNEM95.82 ±0.0296.78 ±0.0296.29 ±0.0196.74 ±0.01
  • GNEM (graph-based) achieves the highest performance (F1 96.29%, Accuracy 96.74%).
  • ComEM (LLM-enhanced) closely follows GNEM with F1 94.02% and Accuracy 94.70%.
  • Embedding+Fuzzy is stronger than traditional baselines but substantially behind GNEM and ComEM (F1 86.68%).
  • BM25 and Faiss lag behind with various precision/recall trade-offs (e.g., Faiss shows high recall but low precision).
  • Traditional methods are fast to initialize but slow at inference; GNEM and ComEM offer better overall efficiency after setup; Emb+Fuzzy balances speed with performance.
  • On CRS dialogue tasks, GNEM remains the top performer, with robustness to conversational variation, though gains over smaller LLMs diminish in dialogue settings.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。