QUICK REVIEW

[论文解读] Overview of the TREC 2021 deep learning track

Nick Craswell, Bhaskar Mitra|ArXiv.org|Jul 10, 2025

Topic Modeling被引用 57

一句话总结

论文报道了TREC Deep Learning Track的第三年，使用更新后的 MS MARCO v2 数据用于文档和段落检索，显示在大规模预训练下的神经排序通常优于传统方法，单阶段检索具竞争力但尚未达到多阶段流水线的水平，并讨论数据收集/覆盖性问题。

ABSTRACT

This is the fifth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human-annotated training labels available for both passage and document ranking tasks. We mostly repeated last year's design, to get another matching test set, based on the larger, cleaner, less-biased v2 passage and document set, with passage ranking as primary and document ranking as a secondary task (using labels inferred from passage). As we did last year, we sample from MS MARCO queries that were completely held out, unused in corpus construction, unlike the test queries in the first three years. This approach yields a more difficult test with more headroom for improvement. Alongside the usual MS MARCO (human) queries from MS MARCO, this year we generated synthetic queries using a fine-tuned T5 model and using a GPT-4 prompt. The new headline result this year is that runs using Large Language Model (LLM) prompting in some way outperformed runs that use the "nnlm" approach, which was the best approach in the previous four years. Since this is the last year of the track, future iterations of prompt-based ranking can happen in other tracks. Human relevance assessments were applied to all query types, not just human MS MARCO queries. Evaluation using synthetic queries gave similar results to human queries, with system ordering agreement of $τ=0.8487$. However, human effort was needed to select a subset of the synthetic queries that were usable. We did not see clear evidence of bias, where runs using GPT-4 were favored when evaluated using synthetic GPT-4 queries, or where runs using T5 were favored when evaluated on synthetic T5 queries.

研究动机与目标

在大规模数据上对 ad hoc 检索方法进行基准测试，使用更新后的 MS MARCO 数据(v2) 对文档和段落进行检索。
在全文检索和重新排序设置中，将神经排序模型与传统基线进行比较。
鼓励对密集检索以及单阶段与多阶段排序流水线的分析。
研究数据刷新对判断、训练标签一致性/兼容性的影响。

提出的方法

利用 MS MARCO v2 数据集进行文档和段落排序任务，覆盖全文检索和前100名再排序子任务。
将具有大规模预训练（nnlm）的神经排序模型与传统方法（trad）和基线方法进行评估。
通过是否使用密集检索以及排序是单阶段还是多阶段对实验运行进行注释。
使用 NIST 判断和 MS MARCO 标签报告指标，如 RR、NDCG@10、NCG@100，以及 AP，适用于两项任务。
分析端到端检索对比重新排序的性能，以及单阶段与多阶段之间的差距。

实验结果

研究问题

RQ1在更新后的 MS MARCO v2 数据中，具有大规模预训练的神经排序模型与传统检索方法在文档和段落任务上的表现如何？
RQ2在端到端排序中，单阶段检索和多阶段检索流水线之间的性能差距是多少？
RQ3数据集刷新（规模、映射、编码修正）如何影响训练标签、判断和整体评估？
RQ4密集检索是否在文档和段落任务中提供持续的增益，特别是在全文检索与再排序设置中？

主要发现

具有大规模预训练的神经排序（nnlm）在文档和段落任务上显著优于传统方法。
在 NDCG@10 上，最佳的 nnlm 文档运行相比最佳传统运行提升约 15%，最佳 nnlm 段落运行在某些比较中显示更大的差距（约 ~36%）。
单阶段（密集）检索可以达到有竞争力的结果，但在两项任务的端到端检索中仍落后于多阶段流水线。
最佳 fullrank（端到端检索）运行在文档和段落任务上比 rerank 运行的提升幅度适中（在 NDCG@10 中约 4–6%）。
密集检索方法出现在顶级提交中，表明采用了神经方法，尽管在全文检索设置中的优越性并非一概而论。
查询长度分析显示较长的查询往往更具辨别力，相关性分析表明长查询评估与所有查询结果更为一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。