QUICK REVIEW

[论文解读] Abstractive and Extractive Text Summarization using Document Context Vector and Recurrent Neural Networks

Chandra Khatri, Gyanit Singh|arXiv (Cornell University)|Jul 20, 2018

Topic Modeling参考文献 27被引用 34

一句话总结

本文提出了一种新颖的文档上下文向量方法，通过基于RNN的序列到序列模型增强生成式与抽取式文本摘要。通过在编码器的第一个时间步注入来自用户行为和卖家数据的上下文信息，该模型生成更具文档特异性的、更符合人类偏好的摘要，在eBay产品描述上实现了最先进性能，尤其在使用大规模弱监督近似摘要进行训练时表现突出。

ABSTRACT

Sequence to sequence (Seq2Seq) learning has recently been used for abstractive and extractive summarization. In current study, Seq2Seq models have been used for eBay product description summarization. We propose a novel Document-Context based Seq2Seq models using RNNs for abstractive and extractive summarizations. Intuitively, this is similar to humans reading the title, abstract or any other contextual information before reading the document. This gives humans a high-level idea of what the document is about. We use this idea and propose that Seq2Seq models should be started with contextual information at the first time-step of the input to obtain better summaries. In this manner, the output summaries are more document centric, than being generic, overcoming one of the major hurdles of using generative models. We generate document-context from user-behavior and seller provided information. We train and evaluate our models on human-extracted-golden-summaries. The document-contextual Seq2Seq models outperform standard Seq2Seq models. Moreover, generating human extracted summaries is prohibitively expensive to scale, we therefore propose a semi-supervised technique for extracting approximate summaries and using it for training Seq2Seq models at scale. Semi-supervised models are evaluated against human extracted summaries and are found to be of similar efficacy. We provide side by side comparison for abstractive and extractive summarizers (contextual and non-contextual) on same evaluation dataset. Overall, we provide methodologies to use and evaluate the proposed techniques for large document summarization. Furthermore, we found these techniques to be highly effective, which is not the case with existing techniques.

研究动机与目标

通过将文档上下文信息融入基于RNN的序列到序列模型，提升生成式与抽取式文本摘要质量。
通过提出一种弱监督方法，实现大规模近似摘要的自动生成，以应对人工标注摘要的可扩展性挑战。
证明上下文感知RNN在抽取式与生成式摘要任务中优于非上下文基线模型。
验证大规模弱监督训练数据在提升黄金标准人工标注评估性能方面的有效性。
证明上下文感知RNN生成的摘要比通用生成模型更具文档中心性、更符合人类偏好。

提出的方法

模型使用从用户行为和卖家提供的元数据中提取的文档上下文向量，在编码器RNN的第一个时间步初始化其隐藏状态。
文档上下文向量通过联合表示模型学习，以捕捉辅助信息中的语义与主题信号。
在弱监督训练中，近似摘要通过基于预训练RNN模型的句子似然得分的弱监督方法自动提取。
抽取式模型采用受自动语音识别启发的重排序策略，候选句子根据其在RNN语言模型下的似然得分进行排序。
生成式与抽取式模型均采用带注意力机制的序列到序列学习进行训练，且在编码器的第一个时间步注入上下文向量。
评估在保留的人工标注摘要集合上进行，指标包括ROUGE、BLEU、NDCG与MAP。

实验结果

研究问题

RQ1在编码器的第一个时间步注入文档上下文是否能提升生成摘要的质量与相关性？
RQ2使用大规模弱监督近似摘要进行训练是否优于使用较小规模人工标注数据集？
RQ3上下文感知RNN在抽取式与生成式摘要任务中与非上下文RNN相比表现如何？
RQ4生成式RNN能否有效适配抽取式摘要任务，并超越专用抽取式模型？
RQ5大规模弱监督数据带来的性能增益是否足以抵消近似摘要中的噪声影响？

主要发现

抽取式上下文RNN（EC-RNN）在5K个人评估测试集上达到99.41%准确率与99.54% F1分数，优于非上下文模型。
生成式上下文RNN（AC-RNN）在弱监督训练下ROUGE-L F1得分为0.26，BLEU得分为0.021，展现出在大规模数据上的强劲性能。
EC-RNN在NDCG@1上得分为0.655，在MAP@3上得分为0.167，表明其在抽取式摘要中具有更优的排序质量。
在10万条算法标注的近似摘要上训练的弱监督模型，其性能达到或超过在5000条人工标注摘要上训练的监督模型。
上下文RNN在所有指标上显著优于非上下文RNN，证明文档上下文在提升摘要相关性方面具有重要价值。
带上下文的生成式模型（AC-RNN）在大规模数据下达到0.23的ROUGE-L F1分数，表明性能随数据规模与上下文注入的提升而增强。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。