[论文解读] BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
BEIR 引入了一个异质的零样本信息检索基准,覆盖 18 个数据集、9 个任务,用以评估 10 个检索系统,强调 BM25 基线的优势以及各架构的泛化能力差异。
Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.
研究动机与目标
- Motivate robust evaluation of IR models beyond in-domain, dataset-specific settings.
- Provide a diverse, zero-shot benchmark spanning multiple tasks and domains.
- Assess generalization capabilities of lexical, sparse, dense, late-interaction, and re-ranking IR models.
提出的方法
- Assemble 18 English zero-shot datasets from 9 retrieval tasks covering diverse domains and document/query characteristics.
- Evaluate 10 retrieval systems across five architectures (lexical, sparse, dense, late-interaction, re-ranking).
- Use a unified data format (corpus, queries, qrels) and standard evaluation metrics (nDCG@10).
- Analyze cross-domain generalization and efficiency (latency and index sizes).
- Investigate annotation bias effects and provide guidance for fair comparisons.
实验结果
研究问题
- RQ1How do diverse IR models generalize to out-of-distribution domains and tasks in a zero-shot setting?
- RQ2Is there a trade-off between retrieval performance and computational efficiency across architectures?
- RQ3What role do annotation biases play in evaluating retrieval systems on BEIR datasets?
- RQ4Which architectures offer robust zero-shot performance and under what domain/task conditions do they excel or fail?
主要发现
| Model ( → ) | Dataset ( ↓ ) | BM25 | DeepCT | SPARTA | docT5query | DPR | ANCE | TAS-B | GenQ | ColBERT | BM25+CE |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BM25 | MS MARCO | 0.228 | 0.296 | 0.351 | 0.338 | 0.177 | 0.388 | 0.408 | 0.408 | 0.401 | 0.413 |
| BM25 | TREC-COVID | 0.656 | 0.406 | 0.538 | 0.713 | 0.332 | 0.654 | 0.481 | 0.619 | 0.677 | 0.757 |
| BM25 | BioASQ | 0.465 | 0.407 | 0.351 | 0.431 | 0.127 | 0.306 | 0.383 | 0.398 | 0.474 | 0.523 |
| BM25 | NFCorpus | 0.325 | 0.283 | 0.301 | 0.328 | 0.189 | 0.237 | 0.319 | 0.319 | 0.305 | 0.350 |
| BM25 | NQ | 0.329 | 0.188 | 0.398 | 0.399 | 0.474 | 0.446 | 0.463 | 0.358 | 0.524 | 0.533 |
| BM25 | HotpotQA | 0.603 | 0.503 | 0.492 | 0.580 | 0.391 | 0.456 | 0.584 | 0.534 | 0.593 | 0.707 |
| BM25 | FiQA-2018 | 0.236 | 0.191 | 0.198 | 0.291 | 0.112 | 0.295 | 0.300 | 0.308 | 0.317 | 0.347 |
| BM25 | Signal-1M (RT) | 0.330 | 0.269 | 0.252 | 0.307 | 0.155 | 0.249 | 0.289 | 0.281 | 0.274 | 0.338 |
| BM25 | TREC-NEWS | 0.398 | 0.220 | 0.258 | 0.420 | 0.161 | 0.382 | 0.377 | 0.396 | 0.393 | 0.431 |
| BM25 | Robust04 | 0.408 | 0.287 | 0.276 | 0.437 | 0.252 | 0.392 | 0.427 | 0.362 | 0.391 | 0.475 |
| BM25 | ArguAna | 0.315 | 0.309 | 0.279 | 0.349 | 0.175 | 0.415 | 0.429 | 0.493 | 0.233 | 0.311 |
| BM25 | Touché-2020 | 0.367 | 0.156 | 0.175 | 0.347 | 0.131 | 0.240 | 0.162 | 0.182 | 0.202 | 0.271 |
| BM25 | CQADupStack | 0.299 | 0.268 | 0.257 | 0.325 | 0.153 | 0.296 | 0.314 | 0.347 | 0.350 | 0.370 |
| BM25 | Quora | 0.789 | 0.691 | 0.630 | 0.802 | 0.248 | 0.852 | 0.835 | 0.830 | 0.854 | 0.825 |
| BM25 | DBPedia | 0.313 | 0.177 | 0.314 | 0.331 | 0.263 | 0.281 | 0.384 | 0.328 | 0.392 | 0.409 |
| BM25 | SCIDOCS | 0.158 | 0.124 | 0.126 | 0.162 | 0.077 | 0.122 | 0.149 | 0.143 | 0.145 | 0.166 |
| BM25 | FEVER | 0.753 | 0.353 | 0.596 | 0.714 | 0.562 | 0.669 | 0.700 | 0.669 | 0.771 | 0.819 |
| BM25 | Climate-FEVER | 0.213 | 0.066 | 0.082 | 0.201 | 0.148 | 0.198 | 0.228 | 0.175 | 0.184 | 0.253 |
| BM25 | SciFact | 0.665 | 0.630 | 0.582 | 0.675 | 0.318 | 0.507 | 0.643 | 0.644 | 0.671 | 0.688 |
- BM25 remains a strong zero-shot baseline across many datasets.
- Re-ranking and late-interaction models often yield best zero-shot performance but with high latency and memory costs.
- Dense and sparseRetrievers frequently underperform BM25 in zero-shot generalization, despite good in-domain results.
- Cross-attentional re-ranking (BM25+CE) and ColBERT show strong out-of-distribution generalization, outperforming BM25 on many datasets.
- GenQ can aid domain adaptation for dense retrievers, improving performance on specialized domains but not universally.
- Annotation bias (hole@10) analyses reveal lexical biases favoring lexical methods and imply underestimation of non-lexical approaches without careful annotation.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。