QUICK REVIEW

[论文解读] Evaluation on Entity Matching in Recommender Systems

Zihan Huang, Rohan Surana|arXiv (Cornell University)|Jan 23, 2026

Topic Modeling被引用 0

一句话总结

本文推出 Reddit-Amazon-EM，这是一个用于跨数据集实体匹配的庞大手工标注数据集，将 Reddit 的电影提及与 Amazon 电影条目进行对照，并基于多种 EM 方法进行基准测试，结果显示基于图的和增强型大型语言模型的方法优于传统基线；还分析 EM 对基于 LLM 的对话式推荐系统的影响。

ABSTRACT

Entity matching is a crucial component in various recommender systems, including conversational recommender systems (CRS) and knowledge-based recommender systems. However, the lack of rigorous evaluation frameworks for cross-dataset entity matching impedes progress in areas such as LLM-driven conversational recommendations and knowledge-grounded dataset construction. In this paper, we introduce Reddit-Amazon-EM, a novel dataset comprising naturally occurring items from Reddit and the Amazon '23 dataset. Through careful manual annotation, we identify corresponding movies across Reddit-Movies and Amazon'23, two existing recommender system datasets with inherently overlapping catalogs. Leveraging Reddit-Amazon-EM, we conduct a comprehensive evaluation of state-of-the-art entity matching methods, including rule-based, graph-based, lexical-based, embedding-based, and LLM-based approaches. For reproducible research, we release our manually annotated entity matching gold set and provide the mapping between the two datasets using the best-performing method from our experiments. This serves as a valuable resource for advancing future work on entity matching in recommender systems.Data and Code are accessible at: https://github.com/huang-zihan/Reddit-Amazon-Entity-Matching.

研究动机与目标

Introduce Reddit-Amazon-EM, the largest publicly available knowledge-grounded EM dataset linking Reddit movie mentions to Amazon catalog entries.
Provide a rigorous evaluation of diverse EM methods (rule-based, lexical, embedding-based, graph-based, and LLM-based) on cross-dataset linking.
Analyze how EM quality affects downstream LLM-driven conversational recommendation systems.
Enable reproducible research by releasing annotated gold data and evaluation code.

提出的方法

Construct Reddit-Amazon-EM by linking ~4k Amazon Movie entries to Reddit movie mentions with manual annotation.
Retrieve candidate Amazon entries for each Reddit title using title-based similarity and metadata filtering.
Use a Streamlit interface with supporting metadata and prompts (e.g., GPT-3.5) to manually confirm correct matches.
Evaluate multiple EM baselines: BM25, Faiss, Embedding+Fuzzy, GNEM, ComEM.
Assess performance with Recall@k, Precision@k, F1, and Accuracy, plus computational efficiency.
Provide an annotated gold set and dataset-candidate mappings for reproducibility.

实验结果

研究问题

RQ1 What is the effectiveness of different entity matching methods on cross-dataset movie linking between Reddit and Amazon?
RQ2 Which EM methods best handle noisy, in-the-wild movie titles and data heterogeneity?
RQ3 How does EM quality translate to performance in LLM-driven conversational recommender systems?
RQ4 What are the trade-offs between precision, recall, and efficiency across EM approaches?

主要发现

Model	Precision	Recall@1	F1 score	Accuracy
Emb+Fuzzy	86.38 ±0.03	86.99 ±0.04	86.68 ±0.03	92.78 ±0.02
BM25	74.93 ±0.04	82.30 ±0.04	78.43 ±0.03	89.71 ±0.02
Faiss	60.51 ±0.04	89.76 ±0.03	72.28 ±0.04	91.83 ±0.02
BM25 + Faiss	74.54 ±0.04	84.49 ±0.04	79.20 ±0.03	90.75 ±0.02
ComEM	94.50 ±0.04	93.97 ±0.04	94.02 ±0.04	94.70 ±0.04
GNEM	95.82 ±0.02	96.78 ±0.02	96.29 ±0.01	96.74 ±0.01

GNEM (graph-based) achieves the highest performance (F1 96.29%, Accuracy 96.74%).
ComEM (LLM-enhanced) closely follows GNEM with F1 94.02% and Accuracy 94.70%.
Embedding+Fuzzy is stronger than traditional baselines but substantially behind GNEM and ComEM (F1 86.68%).
BM25 and Faiss lag behind with various precision/recall trade-offs (e.g., Faiss shows high recall but low precision).
Traditional methods are fast to initialize but slow at inference; GNEM and ComEM offer better overall efficiency after setup; Emb+Fuzzy balances speed with performance.
On CRS dialogue tasks, GNEM remains the top performer, with robustness to conversational variation, though gains over smaller LLMs diminish in dialogue settings.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。