QUICK REVIEW

[논문 리뷰] LLM-based relevance assessment still can't replace human relevance assessment

Charles L. A. Clarke, Laura Dietz|arXiv (Cornell University)|2024. 12. 22.

Semantic Web and Ontologies인용 수 9

한 줄 요약

본 논문은 LLM 기반 관련도 판단이 TREC-식 평가에서 인간 판단을 완전히 대체할 수 있다는 주장에 비판을 제시하고, 실용적·이론적 제약을 보여주며, 변조에 취약하고 편향이 존재한다는 점을 입증한다.

ABSTRACT

The use of large language models (LLMs) for relevance assessment in information retrieval has gained significant attention, with recent studies suggesting that LLM-based judgments provide comparable evaluations to human judgments. Notably, based on TREC 2024 data, Upadhyay et al make a bold claim that LLM-based relevance assessments, such as those generated by the Umbrela system, can fully replace traditional human relevance assessments in TREC-style evaluations. This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion. First, we question whether the evidence provided by Upadhyay et al. genuinely supports their claim, particularly when the test collection is intended to serve as a benchmark for future research innovations.Second, we submit a system deliberately crafted to exploit automatic evaluation metrics, demonstrating that it can achieve artificially inflated scores without truly improving retrieval quality. Third, we simulate the consequences of circularity by analyzing Kendall's tau correlations under the hypothetical scenario in which all systems adopt Umbrela as a final-stage re-ranker, illustrating how reliance on LLM-based assessments can distort system rankings. Theoretical challenges - including the inherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and the potential degradation of future LLM performance - that must be addressed before LLM-based relevance assessments can be considered a viable replacement for human judgments.

연구 동기 및 목표

LLM 기반 판단이 TREC-식 작업에서 인간 관련성 평가를 대체할 수 있다는 증거에 의문을 제기한다.
자동 평가가 어떻게 조작될 수 있는지와 그러한 결과가 왜 오해를 불러일으킬 수 있는지 강조한다.
LLM 기반 관련성 평가의 신뢰성을 저해하는 이론적 문제점과 편향에 대해 논의한다.
검색 유용성 평가의 금표준으로서 인간 판단의 지속적 사용을 옹호한다.

제안 방법

Upadhyay 등(2024)의 LLM 기반 관련성 평가에 대한 발견을 검토하고 비판한다.
LLM 판단으로의 풀링과 재랭킹으로 자동 평가를 어떻게 전복시킬 수 있는지 경험적으로 시연한다.
LLM 기반 관련성 평가가 골드 스탠다드가 아니라 재랭킹 방법으로 작동하는 것을 보여준다.
LLM 나르시시즘 및 프롬프트 공격에 대한 취약성 등 편향에 대한 논의.
Goodhart의 법칙과 자동화 파이프라인에서 향후 LLM 성능 저하 가능성에 대한 고려.

실험 결과

연구 질문

RQ1상위 성능 검색 시스템에 대해 LLM 기반 관련성 판단이 인간 판단과 신뢰할 수 있고 대체 수준의 정합성을 제공하는가?
RQ2자동 LLM 기반 평가 프로세스가 IR 벤치마크의 진행 상황을 측정하는 데 신뢰할 수 있는가, 아니면 쉽게 조작될 수 있는가?
RQ3LLM 기반 관련성 평가가 금표준으로 기능하는 것을 방해하는 이론적·실용적 편향이나 한계는 무엇인가?

주요 결과

상위 실행의 자동 LLM 판단이 수동 판단과 어긋날 수 있어 개선 간 구분 가능성을 약화시킨다.
LLM으로 평가된 풀을 구성하여 자동 평가를 전복시키는 것이 가능하며, 일부 실행에 인위적으로 높은 점수를 산출한다.
LLM 기반 관련성 평가는 진정한 관련성 판단이라기보다는 재랭킹에 더 가깝고 인간 사용성에 대한 근거가 부족하다.
주요 편향과 취약성(예: LLM 나르시시즘, 프롬프트 기반 속임수)이 LLM 기반 평가의 인간 판단 대체 타당성에 도전한다.
엔드투엔드 평가 파이프라인이 더 많은 단계를 자동화함에 따라 수작업 판단과 자동 판단 간의 상관관계가 악화될 우려가 있다(Goodhart의 법칙).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.