QUICK REVIEW

[论文解读] LLM-based relevance assessment still can't replace human relevance assessment

Charles L. A. Clarke, Laura Dietz|arXiv (Cornell University)|Dec 22, 2024

Semantic Web and Ontologies被引用 9

一句话总结

本文批评声称基于 LLM 的相关性判断能够在 TREC 风格评估中完全替代人类判断的说法，并展示实用与理论上的局限性，包括易受颠覆和偏见影响。

ABSTRACT

The use of large language models (LLMs) for relevance assessment in information retrieval has gained significant attention, with recent studies suggesting that LLM-based judgments provide comparable evaluations to human judgments. Notably, based on TREC 2024 data, Upadhyay et al make a bold claim that LLM-based relevance assessments, such as those generated by the Umbrela system, can fully replace traditional human relevance assessments in TREC-style evaluations. This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion. First, we question whether the evidence provided by Upadhyay et al. genuinely supports their claim, particularly when the test collection is intended to serve as a benchmark for future research innovations.Second, we submit a system deliberately crafted to exploit automatic evaluation metrics, demonstrating that it can achieve artificially inflated scores without truly improving retrieval quality. Third, we simulate the consequences of circularity by analyzing Kendall's tau correlations under the hypothetical scenario in which all systems adopt Umbrela as a final-stage re-ranker, illustrating how reliance on LLM-based assessments can distort system rankings. Theoretical challenges - including the inherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and the potential degradation of future LLM performance - that must be addressed before LLM-based relevance assessments can be considered a viable replacement for human judgments.

研究动机与目标

质疑基于 LLM 的判断能否替代 TREC 风格任务中的人类相关性评估的证据。
强调自动评估如何被操控，以及为何此类结果可能具有误导性。
讨论削弱基于 LLM 的相关性评估可靠性的理论挑战与偏见。
倡导继续使用人类判断作为评估检索有用性的金标准。

提出的方法

对 Upadhyay 等人（2024）关于基于 LLM 的相关性评估的发现进行审查与批评。
通过将 LLM 判断进行汇聚和重新排序来经验性演示自动评估如何被颠覆。
说明基于 LLM 的相关性评估更像一种重新排序方法，而非金标准。
讨论诸如 LLM 自恋与对提示攻击的易受影响等偏见。
考虑 Goodhart 定律以及未来自动化管道中 LLM 性能潜在的退化。

实验结果

研究问题

RQ1基于 LLM 的相关性判断是否能为顶尖检索系统提供与人类判断可靠、可替代的对齐？
RQ2自动化的基于 LLM 的评估过程是否可信以衡量信息检索基准的进展，还是容易被操控？
RQ3哪些理论与实践中的偏见或局限性使基于 LLM 的相关性评估无法作为金标准？

主要发现

自动化的 LLM 判断可能与顶级运行的人工判断不一致，从而削弱其区分改进的有用性。
通过以 LLM 判断对集合进行评估来构建，能够实现对自动评估的颠覆，导致某些运行获得人为的高分。
基于 LLM 的相关性评估更像重新排序，而非真正的相关性判断，缺乏对人类有用性的基础。
显著的偏见与脆弱性（例如 LLM 自恋、基于提示的欺骗）对将 LLM 基于评估作为人类判断替代的有效性提出挑战。
人们担心随着端到端评估管道自动化步骤增多，人工与自动判断之间的相关性会退化（Goodhart’s law）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。