QUICK REVIEW

[논문 리뷰] Entity Matching using Large Language Models

Ralph Peeters, Steiner, Aaron|arXiv (Cornell University)|2023. 10. 17.

Topic Modeling인용 수 8

한 줄 요약

논문은 대형 언어 모델(LLM)을 엔티티 매칭에 대해 평가하며, 제로샷 및 파샷 프롬프트를 호스팅된 LLM과 오픈 소스 LLM 간 비교하고, PLM 베이스라인과 대조합니다. 프롬프트 설계가 하이퍼파라미터로 작용한다는 점과 LLM이 특정 태스크 학습 없이 PLM과 맞먹거나 능가할 수 있으며, 보이지 않는 엔티티에 대한 강한 강건성을 보인다고 강조합니다.

ABSTRACT

Entity matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity matching is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent and more robust alternative to PLM-based matchers. The study covers hosted and open-source LLMs which can be run locally. We evaluate these models in a zero-shot scenario and a scenario where task-specific training data is available. We compare different prompt designs and the prompt sensitivity of the models. We show that there is no single best prompt but that the prompt needs to be tuned for each model/dataset combination. We further investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning LLMs using the same pool of training data. Our experiments show that the best LLMs require no or only a few training examples to perform comparably to PLMs that were fine-tuned using thousands of examples. LLM-based matchers further exhibit higher robustness to unseen entities. We show that GPT4 can generate structured explanations for matching decisions and can automatically identify potential causes of matching errors by analyzing explanations of wrong decisions. We demonstrate that the model can generate meaningful textual descriptions of the identified error classes, which can help data engineers to improve entity matching pipelines.

연구 동기 및 목표

PLMs의 엔티티 매칭에서 데이터 효율성 및 보이지 않는 엔티티에 대한 강인성 등 한계를 해결하기 위해 LLM 사용을 유도합니다.
다양한 벤치마크 데이터셋에 걸쳐 다양한 프롬프트 디자인 및 맥락 내 학습 전략을 평가합니다.
개인정보 민감한 사용 사례를 위해 호스팅 vs 오픈 소스 LLM 비교합니다.
일반화를 보존하면서 성능 향상을 위한 LLM 미세 조정 조사합니다.

제안 방법

세 가지 호스팅 LLM(GPT-3.5-turbo-0301, GPT-3.5-turbo-0613, GPT-4)과 세 가지 오픈 소스 LLM(SOLAR, Beluga2, Mixtral)을 여섯 개 EM 벤치마크에서 평가합니다.
강력한 baselines로 RoBERTa-base 및 Ditto(미세 조정된 RoBERTa)를 비교합니다.
엔티티 쌍을 연결된 속성 문자열로 직렬화하고, LLM 출력에서 'yes' 단어를 파싱하여 매치를 결정합니다.
제로샷 프롬프트 디자인의 범위를 탐색합니다(도메인/일반, 단순/복잡, 강제/자유) 및 프롬프트 민감도를 분석합니다.
손으로 선택한, 무작위 또는 관련 휴리스틱으로 선택된 데모를 통해 맥락 학습을 수행하고, 학습된 또는 손으로 작성된 매칭 규칙도 실험합니다.
언seen 데이터에 대한 미세 조정된 PLMs의 보강된 강건성을 평가합니다.
프롬프트에 작업 특화 데이터(데모), 학습 규칙, LLM 미세 조정을 추가로 실험합니다.

실험 결과

연구 질문

RQ1대형 언어 모델이 태스크 특정 학습 데이터 없이 엔티티 매칭을 수행할 수 있는가?
RQ2제로샷 프롬프트 디자인이 모델과 도메인 전반에서 EM 성능에 어떤 영향을 미치는가?
RQ3맥락 내 데모 및 데모 선택 전략이 LLM을 활용한 EM에서 어떤 역할을 하는가?
RQ4오픈 소스 LLM의 로컬 배치가 EM 태스크에서 호스팅 모델과 비교할 만큼의 성능을 제공하는가?
RQ5미세 조정이나 규칙 기반 지침이 일반화를 해치지 않으면서 EM 성능을 더욱 향상시킬 수 있는가?

주요 결과

GPT-4가 데이터셋 전반에서 제로샷 F1이 가장 강력하며, 태스크 특정 학습 없이 여러 데이터셋에서 89% 이상에 도달합니다.
최고의 프롬프트는 하나가 아니며 프롬프트의 효과는 모델과 데이터셋에 따라 다르며 프롬프트를 하이퍼파라미터에 비유할 수 있습니다.
오픈 소스 LLM(SOLAR, Beluga2, Mixtral)은 적절한 프롬 prompting으로 GPT-3.5의 결과에 근접하거나 일치할 수 있으며, 제로샷에서 GPT-4가 여전히 우수합니다.
제로샷 GPT-4는 6개 데이터셋 중 3개에서 미세 조정된 PLMs를 능가하고 나머지에서도 경쟁력이 있어 LLM이 태스크 특화 학습 데이터 필요를 줄이거나 대체할 수 있음을 시사합니다.
EM을 위한 미세 조정은 성능을 크게 향상시키고 교차 데이터셋 일반화를 유지합니다; 미세 조정된 PLMs의 전이는 보지 못한 데이터에서 종종 실패합니다.
맥락 내 데모는 일반적으로 대부분의 모델과 데이터셋의 성능을 향상시키며, 데이터셋과 모델에 따라 이익이 달라집니다; 관련 데모는 GPT-4에 자주 도움이 되고 무작위/손으로 선택된 데모는 오픈 소스 LLM에 도움이 됩니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.