QUICK REVIEW

[论文解读] Hierarchical Learning for Generation with Long Source Sequences

Tobias Rohde, Xiaoxia Wu|arXiv (Cornell University)|Apr 15, 2021

Topic Modeling参考文献 57被引用 41

一句话总结

论文介绍 HAT（Hierarchical Attention Transformer），一种分层注意力的 seq2seq 模型，能够处理长源序列以提升生成任务，在多种摘要数据集上实现 state-of-the-art ROUGE，并在文档级翻译中取得增益。它还分析分层注意力并探索仅编码器的预训练。

ABSTRACT

One of the challenges for current sequence to sequence (seq2seq) models is processing long sequences, such as those in summarization and document level machine translation tasks. These tasks require the model to reason at the token level as well as the sentence and paragraph level. We design and study a new Hierarchical Attention Transformer-based architecture (HAT) that outperforms standard Transformers on several sequence to sequence tasks. Furthermore, our model achieves state-of-the-art ROUGE scores on four summarization tasks, including PubMed, arXiv, CNN/DM, SAMSum, and AMI. Our model outperforms document-level machine translation baseline on the WMT20 English to German translation task. We investigate what the hierarchical layers learn by visualizing the hierarchical encoder-decoder attention. Finally, we study hierarchical learning on encoder-only pre-training and analyze its performance on classification tasks.

研究动机与目标

动机与解决在 seq2seq 任务（摘要与文档级翻译）中处理长源序列所带来的挑战。
提出一个 Hierarchical Attention Transformer (HAT)，通过分层编码器层加入句子级表示。
在长序列摘要基准和文档级机器翻译上展示最先进的性能。
分析分层注意力学习到的内容，并探索用于分类任务的编码器端分层预训练。

提出的方法

在 Transformer 上扩展一个分层编码器，使其对句子级 BOS 令牌进行注意以构建句子表示。
在预处理阶段将 BOS 令牌插入到句子开头，以启用句子级分层注意。
在解码端对标记级编码器输出和基于 BOS 的句子表示都添加注意。
在非分层部分使用 BART 权重进行预训练，对分层组件进行随机初始化；在长序列生成任务上进行微调。
在长序列摘要（PubMed, arXiv, CNN/DM, XSum, SAMSum, AMI, ISCI）和文档级 MT（WMT20 En-De, En-Cs, TED17 Zh-En）上进行评估。
执行仅编码器的分层预训练并在 SQuAD 2.0、MNLI-m、RACE 上进行评估。

实验结果

研究问题

RQ1当源序列很长（文档或多句输入）时，分层注意力是否能提升生成效果？
RQ2分层编码器如何影响解码端注意力与生成质量？
RQ3仅编码器的分层预训练是否有助于处理长输入的分类任务？
RQ4通过可视化分层编码器-解码器的注意力模式可以获得哪些洞见？

主要发现

数据集	R1	R2	RL	数据集	R1	R2	RL
PubMed	45.97	20.15	41.34	XSum	47.60	24.83	39.64
arXiv	46.32	20.65	42.33	CNN/DM	46.54	18.82	42.00
	Transformer-BART	48.35	21.43	36.90	HAT-BART	46.68	19.07	42.17
	HAT-BART	48.36	21.43	37.00	-	-	-	-

分层模型在 PubMed 与 arXiv 摘要数据集上实现了 state-of-the-art 的 ROUGE。
HAT-BART 在 CNN/DailyMail 和 XSum 摘要任务上超越了纯粹的 seq2seq 基线。
在 SAMSum 与 AMI/ISCI 上，HAT 变体相对于基线实现了有竞争力或更高的 ROUGE 分数。
在文档级翻译（WMT20 En-De）上，分层模型优于普通模型；在 En-Cs 和 Zh-En 上的增益则不太明显。
仅编码器的分层预训练在 RACE 上带来更快的收敛和提升，在 SQuAD 2.0 和 MNLI-m 上结果参差不齐。
分层注意力呈现多样且层特定的对句子级 BOS 嵌入的聚焦，表明在不同深度上存在有用的句子级表示。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。