QUICK REVIEW

[论文解读] Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation

Shengyao Zhuang, Houxing Ren|arXiv (Cornell University)|Jun 21, 2022

Data Management and Algorithms被引用 23

一句话总结

DSI-QG 通过在 Differentiable Search Index 中用一组生成的查询来表示文档，联合由跨编码器排序，以对齐索引输入和检索输入并提升单语及跨语检索性能。

ABSTRACT

The Differentiable Search Index (DSI) is an emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between the text of long documents and the identifier of the documents, but then retrieval of document identifiers is based on queries that are commonly much shorter than the indexed documents. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this fundamental problem of current DSI models, we propose a simple yet effective indexing framework for DSI, called DSI-QG. When indexing, DSI-QG represents documents with a number of potentially relevant queries generated by a query generation model and re-ranked and filtered by a cross-encoder ranker. The presence of these queries at indexing allows the DSI models to connect a document identifier to a set of queries, hence mitigating data distribution mismatches present between the indexing and the retrieval phases. Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.

研究动机与目标

识别 DSI 索引（长文档）与检索（短查询）之间的数据分布不匹配。
提出一个索引框架（DSI-QG），通过生成的查询来表示文档，以对齐索引与检索输入。
通过实现跨语查询生成，提升跨语言检索性能。
证明 DSI-QG 在单语和跨语数据集上显著优于原始 DSI 及其他基线。

提出的方法

使用查询生成模型为每个文档生成一组潜在相关查询。
用跨编码器排序器对生成的查询进行排序，并保留前 m 条查询以在索引阶段表示文档。
训练 DSI 模型，使每个文档的生成查询与其 docid 相关联。
可选地应用跨语言查询生成，通过多语言 T5 支持跨语言检索。
在索引阶段，用它们的前 m 条生成查询替换文档，以确保输入分布与检索时的查询相匹配。
使用单语（NQ 320k）和跨语（XOR QA 100k）数据集上的标准 IR 指标进行评估。

实验结果

研究问题

RQ1用生成查询替代文档是否能降低索引与检索之间的数据分布漂移在 DSI 中？
RQ2与原始 DSI 及其他基线相比，DSI-QG 在单语检索任务上的表现如何？
RQ3跨语言查询生成是否能通过 DSI-QG 改善跨语言检索性能？
RQ4生成查询数量（m）与跨编码器排序步骤对性能有何影响？
RQ5生成查询表现出哪些定性特征，它们如何影响检索？

主要发现

DSI-QG 在单语 NQ 320k 上显著优于原始 DSI，在 Hits@1 与 Hits@10 上对不同模型规模有显著提升（例如 DSI-QG-base 与 DSI-QG-large 相对于 DSI-base 与 DSI-large 的改进较大）。
在单语检索方面，DSI-QG 与 T5-base 的 Hits@1 为 63.49，Hits@10 为 82.36，而 DSI-base 明显落后。
采用跨编码器排序和前 m 条查询选择的 DSI-QG 在 XOR QA 100k 的多语言上表现稳健，在大多数情形下实现了最高的 Hits@1。
跨语言查询生成有助于缩小文档与查询之间的语言差异，缓解原始 DSI 中观察到的数据分布不匹配和语言不匹配的影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。