QUICK REVIEW

[论文解读] Learning to Tokenize for Generative Retrieval

Weiwei Sun, Lingyong Yan|arXiv (Cornell University)|Apr 9, 2023

Handwritten Text Recognition Techniques被引用 16

一句话总结

GenRet 通过离散自编码学习语义文档标识符（docids），以实现端到端生成检索，在 NQ320K 上达到最新状态并在 MS MARCO 和 BEIR 上取得强劲结果。

ABSTRACT

Conventional document retrieval techniques are mainly based on the index-retrieve paradigm. It is challenging to optimize pipelines based on this paradigm in an end-to-end manner. As an alternative, generative retrieval represents documents as identifiers (docid) and retrieves documents by generating docids, enabling end-to-end modeling of document retrieval tasks. However, it is an open question how one should define the document identifiers. Current approaches to the task of defining document identifiers rely on fixed rule-based docids, such as the title of a document or the result of clustering BERT embeddings, which often fail to capture the complete semantic information of a document. We propose GenRet, a document tokenization learning method to address the challenge of defining document identifiers for generative retrieval. GenRet learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. Three components are included in GenRet: (i) a tokenization model that produces docids for documents; (ii) a reconstruction model that learns to reconstruct a document based on a docid; and (iii) a sequence-to-sequence retrieval model that generates relevant document identifiers directly for a designated query. By using an auto-encoding framework, GenRet learns semantic docids in a fully end-to-end manner. We also develop a progressive training scheme to capture the autoregressive nature of docids and to stabilize training. We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets to assess the effectiveness of GenRet. GenRet establishes the new state-of-the-art on the NQ320K dataset. Especially, compared to generative retrieval baselines, GenRet can achieve significant improvements on the unseen documents. GenRet also outperforms comparable baselines on MS MARCO and BEIR, demonstrating the method's generalizability.

研究动机与目标

推动端到端文档检索，超越传统的索引-检索流水线。
解决对于生成检索在固定规则标记器之外定义文档标识符的挑战。
提出一个保留文档语义的离散自编码标记化框架。
实现标记化、重建和检索组件的端到端优化。
展示在包括 NQ320K、MS MARCO 与 BEIR 在内的多样化数据集上的泛化能力。

提出的方法

介绍 GenRet，包含三个组件：将文档映射到 docids 的标记化模型 Q、从 docids 重建原始文档的重建模型 R，以及从查询生成 docids 的检索模型 P。
在标记化和检索中使用基于 T5 的共享架构，具备每个时间步大小为 K 的离散潜在代码本（K=512）。
通过自编码目标进行训练，包含重建损失、承诺损失和检索损失，以使 docids 与文档语义和基于查询的检索对齐。
采用渐进式训练方案，在自回归优化 docids 的同时固定先前前缀以稳定学习。
结合多样的聚类技术（代码本初始化和通过 Sinkhorn-Knopp 的 docid 重新分配）以促进 docid 的多样性和语义空间的平衡划分。
应用受约束的解码以确保生成的 docids 在语料库内有效，并使用束搜索进行检索。

Figure 1. Two types of document retrieval models: (a) Dense retrieval encodes queries and documents to dense vectors and retrieves documents by MIPS ; (b) Generative retrieval tokenizes documents as docids and autoregressively generates docids as retrieval results.

实验结果

研究问题

RQ1学习得到的离散文档标记（docids）是否能比固定规则标记在生成检索中更有效地捕捉语义信息？
RQ2端到端训练标记化、重建和检索是否提升检索性能，特别是在未见文档上？
RQ3如何在不牺牲准确性的前提下，稳定自回归的 docid 生成并推动 docid 分配的多样性？
RQ4GenRet 学习的 docids 是否能推广到超出训练分布的多样化数据集（MS MARCO 与 BEIR）？

主要发现

GenRet 在 NQ320K 上达到新的最新状态，在未见测试数据上取得显著提升（相对于最佳生成基线，R@1 提升约 14%）。
GenRet 在 MS MARCO 和 BEIR 上优于现有的生成检索基线，展示出强泛化能力。
自编码标记化框架有助于生成语义上有意义的 docids，并且能够重建原始文档，从而提高对训练中未见文档的检索。
渐进式训练和多样的 docid 聚类稳定训练并提高 docid 多样性，解决自回归学习和标记分配的挑战。
与基于规则的标记器相比，GenRet 更好地表示和检索未见文档，表现出对分布外数据的鲁棒性。

Figure 2. An overview of the proposed method. The proposed method utilizes a document tokenization model to convert a given document into a sequence of discrete tokens, referred to as a docid. This tokenization process allows for the reconstruction of the original document through a reconstruction m

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。