QUICK REVIEW

[論文レビュー] A Comparison of Methods for Evaluating Generative IR

Negar Arabzadeh, Charles L. A. Clarke|arXiv (Cornell University)|Apr 5, 2024

Smart Systems and Machine Learning被引用数 23

ひとこと要約

本論文は、新規回答を生成するGen-IRシステムの評価手法をいくつか調査・検証し、二値評価、階層付き/グレード付き評価、サブトピック、ペアワイズプレファレンス、埋め込みを人間の評価と比較して検証する。これらの手法はTREC Deep Learningトラックで検証され、各手法の自律的な使いやすさと監査性を分析する。

ABSTRACT

Information retrieval systems increasingly incorporate generative components. For example, in a retrieval augmented generation (RAG) system, a retrieval component might provide a source of ground truth, while a generative component summarizes and augments its responses. In other systems, a large language model (LLM) might directly generate responses without consulting a retrieval component. While there are multiple definitions of generative information retrieval (Gen-IR) systems, in this paper we focus on those systems where the system's response is not drawn from a fixed collection of documents or passages. The response to a query may be entirely new text. Since traditional IR evaluation methods break down under this model, we explore various methods that extend traditional offline evaluation approaches to the Gen-IR context. Offline IR evaluation traditionally employs paid human assessors, but increasingly LLMs are replacing human assessment, demonstrating capabilities similar or superior to crowdsourced labels. Given that Gen-IR systems do not generate responses from a fixed set, we assume that methods for Gen-IR evaluation must largely depend on LLM-generated labels. Along with methods based on binary and graded relevance, we explore methods based on explicit subtopics, pairwise preferences, and embeddings. We first validate these methods against human assessments on several TREC Deep Learning Track tasks; we then apply these methods to evaluate the output of several purely generative systems. For each method we consider both its ability to act autonomously, without the need for human labels or other input, and its ability to support human auditing. To trust these methods, we must be assured that their results align with human assessments. In order to do so, evaluation criteria must be transparent, so that outcomes can be audited by human assessors.

研究の動機と目的

応答が潜在的に新規である固定コーパスを超えたGen-IR評価の動機づけ。
人間の評価との一致と監査可能性の観点から、Gen-IRの複数の評価手法を比較する。
TREC Deep Learning Trackデータと最新のLLMを用いて手法の妥当性を検証する。
各手法の自律的適用性と人間による監査適性を評価する。
再現性と監査を支えるため、コード・プロンプト・データのオープンアクセスを提供する。

提案手法

5つのGen-IR評価手法（二値関連性、階層付き/グレード付き関連性、サブトピック関連性、ペアワイズプレファレンス、埋め込み）を説明・比較する。
LLM（GPT-4 および GPT-3.5-Turbo）と埋め込み（Vanilla BERT）を用いて評価または類似性を生成する。
各手法の人間の判断との一致を、DL 2019/2020データセットを用いて検証する。
定義された基準（R1–R3）を通じて各手法の自律運用と監査可能性を評価する。
正確な出力と不正確な出力を対比するため、嘘つき（liars）を含むいくつかのGen-IRシステムで実験する。
再現性のため、プロンプト、プロンプトプロンプト、データ、コードを公開する。

実験結果

リサーチクエスチョン

RQ1Gen-IR出力に対する人間の評価とGen-IR評価手法の整合性はどの程度か。
RQ2これらの手法は、人間が監査できるままでLLM生成ラベルを用いて自律的に動作できるか。
RQ3どの評価手法が高品質と低品質のGen-IR応答を最もうまく区別できるか。
RQ4Gen-IR出力に適用した場合、二値・階層/グレード・サブトピック・ペアワイズ・埋め込みベース評価の相対的な性能はどうか。
RQ5埋め込みと他の手法を、人間の判断との一致と監査性の観点で比較するとどうなるか。

主な発見

二値関連性とグレード付き関連性は人間の判断との一致度が異なるレベルで現れ、場合によってはグレード付き関連性の方が二値より識別性を提供する。
サブトピック関連性は、各クエリあたり複数の述語に渡る解釈可能な二値評価を提供し、監査性を支援する。
ペアワイズプレファレンスは人間の判断との強い一致を生むが、 exemplars が必要で自律性を低下させる可能性がある。
埋め込みは高い識別性を提供するが、 exemplarベースの評価に依存し、間接的な監査を提供する。
GPT-4に基づく評価は、方法を問わず一般にGPT-3.5-Turboより人間の判断との一致が高い。
本研究はGen-IR評価手法の再現と監査を可能にするため、コード・プロンプト・データへのオープンアクセスを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。