QUICK REVIEW

[论文解读] Entity Matching using Large Language Models

Ralph Peeters, Steiner, Aaron|arXiv (Cornell University)|Oct 17, 2023

Topic Modeling被引用 8

一句话总结

该论文评估大型语言模型（LLMs）在实体匹配中的表现，比较零-shot 与少量示例提示在托管与开源 LLMs，以及与 PLM 基线的表现，强调提示设计是一种超参数，并显示在无任务特定训练的情况下，LLMs 可以达到或超过 PLMs 的水平，对未见实体具有强鲁棒性。

ABSTRACT

Entity matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity matching is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent and more robust alternative to PLM-based matchers. The study covers hosted and open-source LLMs which can be run locally. We evaluate these models in a zero-shot scenario and a scenario where task-specific training data is available. We compare different prompt designs and the prompt sensitivity of the models. We show that there is no single best prompt but that the prompt needs to be tuned for each model/dataset combination. We further investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning LLMs using the same pool of training data. Our experiments show that the best LLMs require no or only a few training examples to perform comparably to PLMs that were fine-tuned using thousands of examples. LLM-based matchers further exhibit higher robustness to unseen entities. We show that GPT4 can generate structured explanations for matching decisions and can automatically identify potential causes of matching errors by analyzing explanations of wrong decisions. We demonstrate that the model can generate meaningful textual descriptions of the identified error classes, which can help data engineers to improve entity matching pipelines.

研究动机与目标

Motivate the use of LLMs to address limitations of PLMs in entity matching, notably data efficiency and robustness to unseen entities.
Evaluate a wide range of prompt designs and in-context learning strategies across multiple benchmark datasets.
Compare hosted versus open-source LLMs for privacy-sensitive use cases.
Investigate fine-tuning LLMs for improved performance while preserving generalization.

提出的方法

Evaluate three hosted LLMs (GPT-3.5-turbo-0301, GPT-3.5-turbo-0613, GPT-4) and three open-source LLMs (SOLAR, Beluga2, Mixtral) on six EM benchmarks.
Compare against PLM baselines RoBERTa-base and Ditto (fine-tuned RoBERTa) as strong baselines.
Serialize entity pairs as concatenated attribute strings and decide match by parsing for the word 'yes' in LLM outputs.
Explore a range of zero-shot prompt designs (domain/general, simple/complex, force/free) and analyze prompt sensitivity.
Conduct in-context learning with demonstrations chosen by hand-picked, random, or related heuristics, and also test learned or hand-written matching rules.
Assess model robustness to unseen entities by transferring fine-tuned PLMs to unseen data.
Experiment with adding task-specific data to prompts (demonstrations), learning rules, and fine-tuning LLMs.

实验结果

研究问题

RQ1Can large language models perform entity matching without task-specific training data?
RQ2How do zero-shot prompt designs affect EM performance across models and domains?
RQ3What is the role of in-context demonstrations and demonstration selection strategies in EM with LLMs?
RQ4Does open-source LLMs’ local deployment afford comparable performance to hosted models for EM tasks?
RQ5Can fine-tuning or rule-based guidance further improve EM performance without sacrificing generalization.

主要发现

GPT-4 achieves the strongest zero-shot F1 across datasets, reaching 89%+ on several datasets without task-specific training.
There is no single best prompt; prompt effectiveness depends on model and dataset, likening prompts to hyperparameters.
Open-source LLMs (SOLAR, Beluga2, Mixtral) can approach or match GPT-3.5 results with proper prompting; GPT-4 remains superior in zero-shot.
Zero-shot GPT-4 outperforms fine-tuned PLMs on 3 of 6 datasets and is competitive on others, indicating LLMs can reduce or replace task-specific training data needs.
Fine-tuning an LLM for EM significantly improves performance while preserving cross-dataset generalization; transfer of fine-tuned PLMs often fails on unseen data.
In-context demonstrations generally improve performance for most models and datasets, with gains varying by dataset and model; related demonstrations often help GPT-4, while random/hand-picked help open-source LLMs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。