QUICK REVIEW

[论文解读] Autoregressive Entity Retrieval

Nicola De Cao, Gautier Izacard|arXiv (Cornell University)|Oct 2, 2020

Topic Modeling参考文献 65被引用 200

一句话总结

GENRE 通过在自回归模型中逐字生成实体的唯一名称来检索实体，使用受限解码输出有效的实体标识符，在ED、EL和文档检索方面取得强劲结果，且内存占用显著更小。

ABSTRACT

Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. Current approaches can be understood as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such as their descriptions. This approach has several shortcomings: (i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions; (ii) a large memory footprint is needed to store dense representations when considering large entity sets; (iii) an appropriately hard set of negative data has to be subsampled at training time. In this work, we propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion. This mitigates the aforementioned technical issues since: (i) the autoregressive formulation directly captures relations between context and entity name, effectively cross encoding both; (ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; (iii) the softmax loss is computed without subsampling negative data. We experiment with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new state-of-the-art or very competitive results while using a tiny fraction of the memory footprint of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their names. Code and pre-trained models at https://github.com/facebookresearch/GENRE.

研究动机与目标

通过利用结构化、可组合的实体名称，激发比原子标签分类器更灵活的实体检索。
提出一个自回归的序列到序列框架（GENRE），根据输入上下文生成实体名称。
引入受限解码，从预定义候选集仅生成有效的实体标识符。
证明 GENRE 在 ED、EL 和文档检索上实现强劲表现，同时显著降低内存使用。
表明通过简单地将新实体的明确名称追加到候选集即可添加新实体。

提出的方法

使用基于 Transformer 的 seq2seq 模型（预训练于语言建模目标，例如 BART）并微调以生成实体名称。
用实体的文本名称表示实体，并通过对实体名称的 y 个标记的自回归乘积 pθ(y|x) 来对给定输入 x 的实体 e 进行评分。
使用标准的 seq2seq 目标（带教师强制的最大似然估计）进行训练，不采用负采样。
在推理阶段，通过对有效实体名称的 Trie 使用受限束搜索，只输出候选集中的实体。
应用受限解码，确保生成的输出是有效的实体标识符，并实现高效的精确 softmax 计算。
将自回归解码扩展为端到端实体链接，使用带动态实体名称 Trie 的动态标记输出进行端到端链接。

实验结果

研究问题

RQ1一个自回归模型是否能在给定输入上下文的条件下生成实体名称，从而有效地执行 ED、EL 和文档检索？
RQ2通过 Trie 将解码约束在候选集上是否在保持准确性的同时实现大规模解码的高效？
RQ3在 ED、EL 和 DR 任务中，GENRE 与现有的双编码器/基于分类器的检索器在准确性和内存占用方面的比较如何？
RQ4是否可以仅通过将新实体的名称添加到候选集来添加新实体，而无需重新训练？
RQ5训练数据（例如在 BLINK 上的预训练、在领域数据集上的微调）对 ED/EL/DR 性能有多大影响？

主要发现

GENRE 在三个任务族（ED、EL、以及页面级 DR）中的20多个数据集上实现了最先进或具竞争力的结果。
GENRE 通过对实体名称进行索引而不是对每个实体向量进行密集表示，显著降低内存占用（平均约小20倍）。
对Trie的受限束搜索确保输出是有效的实体名称，并在没有负采样的情况下实现精确的 softmax 计算。
使用结构化、可组合的实体名称空间有助于泛化，尤其在完全匹配的名称重叠部分缺失时。
新实体可以通过简单地将其明确名称追加到候选集来添加，而无需重新训练。
在 DR 任务（KILT 基准测试）上，GENRE 相对于强基线平均提升高达 13.7 R-precision 点，在除 Natural Questions 之外的数据集上为最好或接近最好。
在 ED 中，GENRE 在领域内数据上提升温和，但在域外设置上提升更大，显示出强跨领域鲁棒性。
在 EL 中，GENRE 在 AIDA 上领域内表现最好，并在多个域外数据集（如 Derczynski、KORE50）上显示出显著改进。
消融研究表明，与非约束或无候选集变体相比，受限解码和使用候选集可显著提升性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。