QUICK REVIEW

[論文レビュー] LLM-based relevance assessment still can't replace human relevance assessment

Charles L. A. Clarke, Laura Dietz|arXiv (Cornell University)|Dec 22, 2024

Semantic Web and Ontologies被引用数 9

ひとこと要約

この論文は、LLMベースの関連性判断がTRECスタイルの評価において人間の判断を完全に置換できるとの主張を批判し、実践的および理論的な限界を示す。これには、悪用の脆弱性やバイアスを含む。

ABSTRACT

The use of large language models (LLMs) for relevance assessment in information retrieval has gained significant attention, with recent studies suggesting that LLM-based judgments provide comparable evaluations to human judgments. Notably, based on TREC 2024 data, Upadhyay et al make a bold claim that LLM-based relevance assessments, such as those generated by the Umbrela system, can fully replace traditional human relevance assessments in TREC-style evaluations. This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion. First, we question whether the evidence provided by Upadhyay et al. genuinely supports their claim, particularly when the test collection is intended to serve as a benchmark for future research innovations.Second, we submit a system deliberately crafted to exploit automatic evaluation metrics, demonstrating that it can achieve artificially inflated scores without truly improving retrieval quality. Third, we simulate the consequences of circularity by analyzing Kendall's tau correlations under the hypothetical scenario in which all systems adopt Umbrela as a final-stage re-ranker, illustrating how reliance on LLM-based assessments can distort system rankings. Theoretical challenges - including the inherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and the potential degradation of future LLM performance - that must be addressed before LLM-based relevance assessments can be considered a viable replacement for human judgments.

研究の動機と目的

LLMベースの判断がTRECスタイルのタスクで人間の関連性評価を置換できるという証拠を検討する。
自動評価がどのように操作され得るかを強調し、そのような結果がなぜ誤解を招く可能性があるかを説明する。
LLMベースの関連性評価を信頼性を損なう理論的課題とバイアスを論じる。
情報検索の有用性を評価する金標準として人間の判断の継続的利用を主張する。

提案手法

LLMベースの関連性評価に関するUpadhyay et al. (2024) の発見を検討・批判する。
LLM判断を用いたプール作成と再ランキングによって自動評価がどう転用され得るかの実証的デモンストレーション。
LLMベースの関連性評価が金標準ではなく再ランキングとして機能する様子を示す。
LLMナルシシズムやプロンプト攻撃への脆弱性などのバイアスを含む議論。
Goodhartの法則と自動化パイプラインで将来のLLM性能が低下する可能性を検討する。

実験結果

リサーチクエスチョン

RQ1LLMベースの関連性判断は、トップクラスの検索システムに対して人間の判断と信頼できる置換基準の整合性を提供しますか？
RQ2自動的なLLMベース評価プロセスはIRのベンチマークの進歩を測るのに信頼できますか、操作されやすいですか？
RQ3理論的・実践的なバイアスや制約は、LLMベースの関連性評価をゴールドスタンダードとして機能させるのを妨げますか？

主な発見

自動LLM判断は、トップパフォーマンスの実行で manual judgments とずれ、改善を識別するという点での有用性を損なう。
LLMで判断されたプールを構築することで自動評価を欺くことが可能で、いくつかの実行で人為的に高いスコアを生む。
LLMベースの関連性評価は真の関連性判断というより再ランキングに近く、人間の有用性に基づく grounding を欠く。
顕著なバイアスと脆弱性（例：LLMナルシシズム、プロンプト攻撃への脆弱性）を含む議論が、LLMベースの評価を人間判断の代替としての妥当性に挑む。
エンドツーエンドの評価パイプラインが自動化されるにつれて、手動判断と自動判断の相関が劣化する懸念がある（Goodhartの法則）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。