QUICK REVIEW

[论文解读] BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur, Nils Reimers|arXiv (Cornell University)|Apr 17, 2021

Topic Modeling参考文献 65被引用 25

一句话总结

BEIR 引入了一个异质的零样本信息检索基准，覆盖 18 个数据集、9 个任务，用以评估 10 个检索系统，强调 BM25 基线的优势以及各架构的泛化能力差异。

ABSTRACT

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.

研究动机与目标

Motivate robust evaluation of IR models beyond in-domain, dataset-specific settings.
Provide a diverse, zero-shot benchmark spanning multiple tasks and domains.
Assess generalization capabilities of lexical, sparse, dense, late-interaction, and re-ranking IR models.

提出的方法

Assemble 18 English zero-shot datasets from 9 retrieval tasks covering diverse domains and document/query characteristics.
Evaluate 10 retrieval systems across five architectures (lexical, sparse, dense, late-interaction, re-ranking).
Use a unified data format (corpus, queries, qrels) and standard evaluation metrics (nDCG@10).
Analyze cross-domain generalization and efficiency (latency and index sizes).
Investigate annotation bias effects and provide guidance for fair comparisons.

实验结果

研究问题

RQ1How do diverse IR models generalize to out-of-distribution domains and tasks in a zero-shot setting?
RQ2Is there a trade-off between retrieval performance and computational efficiency across architectures?
RQ3What role do annotation biases play in evaluating retrieval systems on BEIR datasets?
RQ4Which architectures offer robust zero-shot performance and under what domain/task conditions do they excel or fail?

主要发现

Model ( → )	Dataset ( ↓ )	BM25	DeepCT	SPARTA	docT5query	DPR	ANCE	TAS-B	GenQ	ColBERT	BM25+CE
BM25	MS MARCO	0.228	0.296	0.351	0.338	0.177	0.388	0.408	0.408	0.401	0.413
BM25	TREC-COVID	0.656	0.406	0.538	0.713	0.332	0.654	0.481	0.619	0.677	0.757
BM25	BioASQ	0.465	0.407	0.351	0.431	0.127	0.306	0.383	0.398	0.474	0.523
BM25	NFCorpus	0.325	0.283	0.301	0.328	0.189	0.237	0.319	0.319	0.305	0.350
BM25	NQ	0.329	0.188	0.398	0.399	0.474	0.446	0.463	0.358	0.524	0.533
BM25	HotpotQA	0.603	0.503	0.492	0.580	0.391	0.456	0.584	0.534	0.593	0.707
BM25	FiQA-2018	0.236	0.191	0.198	0.291	0.112	0.295	0.300	0.308	0.317	0.347
BM25	Signal-1M (RT)	0.330	0.269	0.252	0.307	0.155	0.249	0.289	0.281	0.274	0.338
BM25	TREC-NEWS	0.398	0.220	0.258	0.420	0.161	0.382	0.377	0.396	0.393	0.431
BM25	Robust04	0.408	0.287	0.276	0.437	0.252	0.392	0.427	0.362	0.391	0.475
BM25	ArguAna	0.315	0.309	0.279	0.349	0.175	0.415	0.429	0.493	0.233	0.311
BM25	Touché-2020	0.367	0.156	0.175	0.347	0.131	0.240	0.162	0.182	0.202	0.271
BM25	CQADupStack	0.299	0.268	0.257	0.325	0.153	0.296	0.314	0.347	0.350	0.370
BM25	Quora	0.789	0.691	0.630	0.802	0.248	0.852	0.835	0.830	0.854	0.825
BM25	DBPedia	0.313	0.177	0.314	0.331	0.263	0.281	0.384	0.328	0.392	0.409
BM25	SCIDOCS	0.158	0.124	0.126	0.162	0.077	0.122	0.149	0.143	0.145	0.166
BM25	FEVER	0.753	0.353	0.596	0.714	0.562	0.669	0.700	0.669	0.771	0.819
BM25	Climate-FEVER	0.213	0.066	0.082	0.201	0.148	0.198	0.228	0.175	0.184	0.253
BM25	SciFact	0.665	0.630	0.582	0.675	0.318	0.507	0.643	0.644	0.671	0.688

BM25 remains a strong zero-shot baseline across many datasets.
Re-ranking and late-interaction models often yield best zero-shot performance but with high latency and memory costs.
Dense and sparseRetrievers frequently underperform BM25 in zero-shot generalization, despite good in-domain results.
Cross-attentional re-ranking (BM25+CE) and ColBERT show strong out-of-distribution generalization, outperforming BM25 on many datasets.
GenQ can aid domain adaptation for dense retrievers, improving performance on specialized domains but not universally.
Annotation bias (hole@10) analyses reveal lexical biases favoring lexical methods and imply underestimation of non-lexical approaches without careful annotation.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。