QUICK REVIEW

[论文解读] InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Vitor Jeronymo, Luiz Bonifacio|arXiv (Cornell University)|Jan 4, 2023

Topic Modeling被引用 26

一句话总结

InPars-v2 使用开源 LLMs 生成合成查询-文档对，通过 monoT5 重排序器筛选，并针对 18 个数据集训练特定的重排序器，在 BEIR 上取得新状态-of-the-art，同时公开代码、数据和模型。

ABSTRACT

Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu

研究动机与目标

在标注域内数据稀缺时，推动 IR 的数据增强。
用开源替代 proprietary LLMs 进行合成查询生成。
引入更好的筛选步骤以选择高质量的合成查询-文档对。
展示最先进的 BEIR 结果并提供可复现实验的开源产物。

提出的方法

使用开源 GPT-J-6B 结合 3-shot MS MARCO 提示，为每个 BEIR 数据集生成 10 万条合成查询。
通过在 MS MARCO 上微调的 monoT5-3B 进行打分筛选，筛出 1 万条高质量对。
通过从每个合成查询的前 BM25 结果中采样非相关文档来创建负样本。
将 monoT5-3B 作为重排序器在 MS MARCO 上进行微调，然后在合成数据上进一步微调（按数据集）。
训练独立的重排序器（每个 BEIR 数据集一个），并通过 BM25 检索 + 重排序流程进行评估。

实验结果

研究问题

RQ1开源 LLM 是否能够生成在 IR 训练中可与专有方法相竞争的合成数据？
RQ2通过学习型重排序器筛选步骤是否提升合成查询-文档对的质量从而提升 IR 训练效果？
RQ3在用合成数据训练数据集特定的重排序器时，BEIR 级别的增益有多大？

主要发现

数据集	BM25	monoT5-3B	+InPars-v1	+InPars-v2	平均	平均 PrGator
MARCO	0.594	0.801	0.846	0.846	0.762	0.823
TREC-Covid	0.594	0.801	0.846	0.846	0.762	0.823
Robust	0.407	0.615	0.610	0.632	-	-
FiQA	0.236	0.509	0.492	0.509	0.494	0.493
DBPedia	0.318	0.472	0.494	0.498	0.434	0.459
SciDocs	0.149	0.197	0.206	0.208	0.201	0.191
SciFact	0.678	0.774	0.774	0.774	0.731	0.760
NFCorpus	0.321	0.383	0.385	0.385	0.370	0.399
BioASQ	0.522	0.566	0.607	0.595	-	0.579
Natural Questions	0.305	0.625	0.625	0.638	-	0.647
HotpotQA	0.633	0.760	0.790	0.791	0.736	0.753
TREC-News	0.395	0.477	0.458	0.490	-	-
Quora	0.788	0.835	0.874	0.845	-	0.819
FEVER	0.651	0.848	0.852	0.872	0.866	0.848
Climate-FEVER	0.165	0.288	0.287	0.323	0.241	0.275
Signal	0.328	0.302	0.319	0.308	-	0.319
ArguAna	0.397	0.379	0.371	0.369	0.630	0.406
Touche	0.442	0.309	0.260	0.291	0.381	0.486
CQADupstack	0.302	0.449	0.449	0.448	-	-

InPars-v2 超越 InPars-v1，并在 BEIR 的平均结果上达到最先进的水平。
在 BEIR 基准测试中，该方法在多数数据集上与 Promptagator 和 RankT5 相比具有竞争力的结果。
在基于 MARCO 的合成数据经过 monoT5-3B 重排序器筛选后，获得强劲的 BEIR 表现。
开源合成数据、代码和微调模型有助于可复现性和后续研究。
平均 BEIR 性能（Avg）比 Avg PrGator 基线在多个数据集上显示出提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。