QUICK REVIEW

[论文解读] Models and Data for Simple Applications of BERT for Ad Hoc Document Retrieval

Yang, Wei, Wei Yang|arXiv (Cornell University)|Mar 26, 2019

Topic Modeling参考文献 23被引用 133

一句话总结

本文展示了一个简单的句子级 BERT 方法用于即时文档检索，通过聚合句子分数来对更长的文档进行排序，在 Microblog 和 Robust04 数据集上取得了出色的结果。

ABSTRACT

Following recent successes in applying BERT to question answering, we explore simple applications to ad hoc document retrieval. This required confronting the challenge posed by documents that are typically longer than the length of input BERT was designed to handle. We address this issue by applying inference on sentences individually, and then aggregating sentence scores to produce document scores. Experiments on TREC microblog and newswire test collections show that our approach is simple yet effective, as we report the highest average precision on these datasets by neural approaches that we are aware of.

研究动机与目标

在文档较长且缺乏句子级相关性数据的情况下，推动将 BERT 应用于即时文档检索。
提出一种简单的推理与聚合技术，避免在文档级标签上进行复杂微调。
在 TREC Microblog Tracks 和 Robust04 上评估该方法，以建立基线神经网络性能。
证明基于句子级推理并结合分数聚合的方法可以达到与以往神经模型相竞争甚至更优的结果。

提出的方法

使用 Anserini 进行初步检索，使用 BERT 进行句子级相关性分类。
在可用的句子级数据或相关数据（microblog、QA、WikiQA）上微调 BERT，并使用 CLS 向量执行二元相关性分类。
对于短文档（microblogs），将查询与文档拼接为 BERT 的输入，并将 BERT 分数与 IR 分数进行插值。
对于较长的文档（newswire），对前 top 句计算 BERT 得分，并使用带有超参数 a 和 w_i 的加权和，将其与原始文档分数进行聚合。
通过交叉验证调整插值权重和句子数量（前 n 句）。
将 AP 和 P30 作为评估指标并与 BM25+RM3 及各种神经基线进行比较。

实验结果

研究问题

RQ1Can BERT be effectively applied to ad hoc document retrieval given the length mismatch between documents and BERT input limits?
RQ2Does sentence-level inference with score aggregation yield competitive or superior performance to traditional neural ranking models on standard datasets?
RQ3What is the impact of fine-tuning data source (microblog vs QA/WikiQA) on the effectiveness of BERT for retrieval?
RQ4How does aggregating top-scoring sentences compare to using the full document score for retrieval?

主要发现

BERT-based scoring with simple sentence-level inference improves over prior neural models on Microblog tracks, achieving substantial gains in AP and P30.
On Robust04, fine-tuning BERT on microblog data outperforms QA-based fine-tuning, suggesting task relevance matters more than document genre.
The best results for Robust04 come from using the top three sentences; adding a fourth sentence does not help under the tuned settings.
BM25+RM3 remains a strong baseline and, in some settings, outperforms neural models, but the proposed BERT-based reranker yields further significant improvements.
Across evaluated datasets, the simple sentence-level aggregation approach yields state-of-the-art results among neural methods reported at the time for these tasks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。