QUICK REVIEW

[论文解读] Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

Gauthier Guinet, Behrooz Omidvar-Tehrani|arXiv (Cornell University)|May 22, 2024

Topic Modeling被引用 6

一句话总结

本论文提出一种基于考试的自动化评估框架，使用项目反应理论(IRT)来衡量任务特定的准确性并指导RAG设计选择。

ABSTRACT

We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG). Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions based on the corpus of documents associated with the task. Our method is an automated, cost-efficient, interpretable, and robust strategy to select the optimal components for a RAG system. We leverage Item Response Theory (IRT) to estimate the quality of an exam and its informativeness on task-specific accuracy. IRT also provides a natural way to iteratively improve the exam by eliminating the exam questions that are not sufficiently informative about a model's ability. We demonstrate our approach on four new open-ended Question-Answering tasks based on Arxiv abstracts, StackExchange questions, AWS DevOps troubleshooting guides, and SEC filings. In addition, our experiments reveal more general insights into factors impacting RAG performance like size, retrieval mechanism, prompting and fine-tuning. Most notably, our findings show that choosing the right retrieval algorithms often leads to bigger performance gains than simply using a larger language model.

研究动机与目标

在没有地面真相数据集的情况下，衡量基于领域语料的Retrieval-Augmented LLMs (RAG) 的任务特异性准确性。
提供一个可扩展、可解释的评估框架，指引组件选择（LLM、检索、上下文学习）。
展示考试信息量和组件贡献如何随任务和检索策略的变化而变化。

提出的方法

使用预训练的LLM从任务语料自动生成多项选择题考试。
应用先验和事后筛选以提升问题质量和区分度。
通过回答考试题来评估RAG管道并计算准确性。
使用分层的项目反应理论(IRT)模型将能力分解为(LLM、检索、ICL)组件。
估计题目参数(g_i, d_i, b_i)和问题层面的信息量以对考试进行加权。
使用费舍尔信息和布鲁姆分类法对考试进行分层以迭代地最大化信息量。

Figure 1: Granular results of our exam evaluation for the task of AWS DevOps troubleshooting. Accuracy is reported for different retrieval approaches and retriever sizes, on a % scale. Labels on the diameter shows the troubleshooting categories, i.e., AWS resources. Colors correspond to different re

实验结果

研究问题

RQ1在没有地面真相标签的情况下，我们如何自动评估RAG系统的任务特异性准确性？
RQ2哪些检索策略和LLM规模在多领域中可获得最佳的任务特异性性能？
RQ3项目反应理论(IRT)如何帮助解释和改进RAG评估与设计决策？
RQ4驱动RAG性能的因素有哪些（检索方法、模型规模、提示）以及如何优化考试以提升信息量？

主要发现

最佳绝对准确度（百分比）	检索	t_ops	t_stk	t_arx	t_sec
52.2	ClosedB	48.6	54.5	49.5	51.2
45.5	SIAM	50.0	57.0	47.6	50.0
52.2	DPR	58.3	60.3	60.5	57.8
58.0	BM25	60.4	69.5	55.3	60.8
57.7	MultiQA	72.2	69.5	53.6	63.2
55.1	DPRV2	70.1	69.4	63.9	64.6
63.8	Oracle	74.3	68.6	70.9	69.4

检索方法的选择往往比简单增加LLM规模更重要，某些任务对检索变体（例如 BM25、MultiQA、DPRV2）收益更明显。
混合检索模型通常比单一方法检索器在跨任务上提供更高的鲁棒性和适应性。
闭源知识主导LLM，这意味着在预训练期间因保密性限制访问而使检索质量成为性能的限制因素。
对齐不良的检索器可能比不进行检索（ClosedB）表现更差，强调了匹配检索与任务的重要性。
基于IRT的组件分解揭示了LLM、检索和上下文学习在各任务中的相对贡献。
通过费舍尔信息和布鲁姆分类法进行的考试信息量分析有助于诊断并迭代改进考试质量。

Figure 2: Representation of Bloom’s revised taxonomy. The cognitive complexity of skills increase from the bottom to the top of the pyramid. Source: Vanderbilt University Center for Teaching,

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。