[论文解读] Large language models can accurately predict searcher preferences
本文表明大型语言模型能够生成与真实用户偏好一致的相关性标签,达到接近人类的准确度,超越一些第三方标注者,并实现用于排序模型的可扩展训练。
Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality "gold" labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.
研究动机与目标
- 评估LLMs是否能够再现来自真实用户偏好的金标准相关性标签。
- 将LLM生成的标签与金标准评估者和第三方标注者在准确性与可靠性方面进行比较。
- 分析提示设计与特征(描述、叙述、方面、多位评判)如何影响标注质量。
- 评估基于LLM的标注在训练改进排名模型方面的潜力。
提出的方法
- 使用带有训练评估者金标签作为真值的TREC-Robust 2004数据。
- 应用内部GPT-4提示,结合各种特征配置,对标签进行0–2尺度的产出。
- 用MAE和对金标准的 Cohen’s kappa 评估标签质量,以及文档级相关性和成对偏好 的AUC。
- 分析提示特征的影响(角色、描述、叙述、方面、多位评判)以及提示长度/改述的敏感性。
- 使用基于秩的度量(RBO)衡量对查询和系统排序的影响,并与人工标注进行比较。
- 使用自助抽样法报告95%置信区间并确定统计显著差异。
实验结果
研究问题
- RQ1LLMs 能否为 TREC-Robust 数据再现金标准相关性标签?
- RQ2提示特征与配置如何影响 LLM 标注的准确性及与金标签的一致性?
- RQ3LLM 生成的标签是否与第一方真实用户偏好(超越专家标签)对齐?
- RQ4与人工标注相比,基于LLM标注在下游排序性能上的影响如何?
主要发现
- LLMs 能实现与金标签的显著一致性;Cohen’s kappa 根据提示在 0.20 至 0.64 之间变化。
- 对于二元结果(相关/不相关),当模型标注相关或高度相关时,标签具备较强的可靠性。
- 包含方面(主题性和信任)显著提高一致性(kappa 增加约 0.21)。
- 提示设计甚至微小改写会显著影响准确性,表明对提示措辞敏感。
- LLMs 在与金标签的一致性方面可超越众包工作者,同时具备成本和可扩展性优势。
- LLM标注的数据可用于训练更有效的排序模型。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。