QUICK REVIEW

[论文解读] CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization

Andre Esteva, Anuprit Kale|arXiv (Cornell University)|Jun 17, 2020

Topic Modeling参考文献 26被引用 32

一句话总结

CO-Search 是一个用于 COVID-19 文献的检索-排序语义检索引擎，它将 SBERT 嵌入与 TF-IDF 和 BM25 相结合，辅以多跳问答模块和抽象摘要生成，以对文档答案进行排序和呈现。

ABSTRACT

The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. As of May 2020, 128,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset Challenge. Here we present CO-Search, a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers during a time of crisis. The retriever is built from a Siamese-BERT encoder that is linearly composed with a TF-IDF vectorizer, and reciprocal-rank fused with a BM25 vectorizer. The ranker is composed of a multi-hop question-answering module, that together with a multi-paragraph abstractive summarizer adjust retriever scores. To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations, creating 1.3 million (citation title, paragraph) tuples for training the encoder. We evaluate our system on the data of the TREC-COVID information retrieval challenge. CO-Search obtains top performance on the datasets of the first and second rounds, across several key metrics: normalized discounted cumulative gain, precision, mean average precision, and binary preference.

研究动机与目标

提供一个有效的检索系统用于快速增长的 COVID-19 文献语料库（CORD-19）。
整合语义和基于关键词的检索信号以实现鲁棒的文档排序。
通过多跳问答输出和抽象摘要来提升排序以提高可回答性。
使用段落-引证二部图来训练领域感知嵌入以改善语义检索。
在 TREC-COVID 基准上评估性能并发布开源代码。

提出的方法

创建段落和引证的二部图以生成 1.3 百万条 (段落, 标题) 训练元组用于 SBERT。
使用 SBERT 对查询和文档进行嵌入以实现语义最近邻检索。
线性地将 SBERT 段落分数与 TF-IDF 文档分数结合，并通过互惠排序融合将其与 BM25 融合。
使用多跳问答模型提取答案片段并基于问答输出调节排序。
训练一个 abstractive 摘要生成器（BERT 编码器 + 修改的 GPT-2 解码器）以生成在排序中使用的单个跨注意力为基础的摘要。

实验结果

研究问题

RQ1一个融合语义、TF-IDF 和 BM25 信号的检索-排序模型是否能改善 COVID-19 文献检索？
RQ2引入多跳问答和抽象摘要是否能提高检索文档的相关性和实用性？
RQ3基于段落-引证二部图的 SBERT 训练在中小型域数据集中对语义检索的影响如何？
RQ4基于问答驱动和摘要驱动的调制对最终排序性能有何影响？

主要发现

CO-Search 在 Round 1 的多项指标（nDCG@10、P@5、P@10、MAP、Bpref）中达到自动化系统的最高性能。
在 Round 2，CO-Search 在相同指标上在自动化系统中排名第一，并与所有系统（包括非自动化系统）在各轮比较时也排名靠前。
在对所有主题-文档对（有注释和无注释）进行评估时，CO-Search 在 Round 1 位列前 21 名，在 Round 2 位列前 3 名。
该系统是自动化且开源的，旨在在 COVID-19 危机期间支持研究和实际检索需求。
该体系结构将语义段落嵌入与基于关键字的文档表示相结合，并使用问答引导和摘要引导的再排序方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。