QUICK REVIEW

[论文解读] Variational Transformers for Diverse Response Generation

Zhaojiang Lin, Genta Indra Winata|arXiv (Cornell University)|Mar 28, 2020

Speech Recognition and Synthesis参考文献 27被引用 46

一句话总结

本文提出变分变换器（VT）模型——全局变分变换器（GVT）和序贯变分变换器（SVT）——将 Transformer 的高效性与 CVAE 风格的潜变量耦合，以获得多样且连贯的对话回复，在自动指标和人工评估中均优于基线。

ABSTRACT

Despite the great promise of Transformers in many sequence modeling tasks (e.g., machine translation), their deterministic nature hinders them from generalizing to high entropy tasks such as dialogue response generation. Previous work proposes to capture the variability of dialogue responses with a recurrent neural network (RNN)-based conditional variational autoencoder (CVAE). However, the autoregressive computation of the RNN limits the training efficiency. Therefore, we propose the Variational Transformer (VT), a variational self-attentive feed-forward sequence model. The VT combines the parallelizability and global receptive field of the Transformer with the variational nature of the CVAE by incorporating stochastic latent variables into Transformers. We explore two types of the VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of fine-grained latent variables. Then, the proposed models are evaluated on three conversational datasets with both automatic metric and human evaluation. The experimental results show that our models improve standard Transformers and other baselines in terms of diversity, semantic relevance, and human judgment.

研究动机与目标

解决基于 Transformer 的对话生成的单调性和通用性问题。
在 Transformer 中引入随机潜变量，以捕捉多样且情境相关的回复。
比较全局（话语层面）与序贯潜变量设计在对话建模中的差异。
在多个人工数据集上使用自动指标和人工判断进行评估。

提出的方法

提出两种 VT 变体：全局变分变换器（GVT），在解码输入中加入全局潜变量；序贯变分变换器（SVT），在每个解码位置引入一系列潜变量。
在 Transformer 框架内使用受 CVAE 启发的先验和后验潜变量建模，利用 SVT 的非因果注意力进行潜变量计算。
引入 KL 退火和词袋（bag-of-words）辅助损失，以缓解潜变量消失并鼓励潜在表示的信息性。
以 ELBO 目标为基础，结合 SBOW 辅助损失进行训练，促进潜变量对未来生成的规划（按位置）。
采用4层 Transformer 基础模型，隐藏单元300，注意力头4，潜变量维度300，并重用MLE 预训练和应用 Adam 优化。

实验结果

研究问题

RQ1将潜变量整合到基于 Transformer 的对话模型中，是否能在不牺牲语义相关性的前提下提高回复多样性？
RQ2全局（话语层面）与序贯（逐字/逐词）潜变量如何影响生成质量和人工判断？
RQ3KL 退火与辅助损失是否稳定训练并在 VT 模型中保留有用的潜在信息？
RQ4GVT 与 SVT 在不同数据集上的自动指标和人类评估的比较效果如何？

主要发现

模型	PPL	KLD	多样性	嵌入相似性（EMB_FT）	嵌入相似性（EMB_BERT）	连贯性	情感/投入度	Dist-1	Dist-2	Dist-3
Seq2Seq	130.75	-	0.0055	0.0187	0.0347	0.738	0.594	20.67	20.67	-
CVAE	35.33	27.55	0.0189	0.1340	0.3640	0.751	0.613	18.33	18	-
Transformer	72.66	-	0.0040	0.0161	0.0324	0.741	0.596	19.67	23.33	-
GVT	19.71	18.15	0.0207	0.1524	0.4064	0.753	0.609	23	22.67	-
SVT	18.96	32.27	0.0079	0.1053	0.3654	0.762	0.619	26	27.67	-
Human	-	-	-	-	-	-	-	-	-	-
CVAE	31.32	10.01	0.0186	0.1102	0.295	0.917	0.666	20.67	21.33	-
Transformer	48.03	-	0.0058	0.0237	0.0524	0.915	0.672	24.67	24.67	-
GVT	18.34	19.13	0.0204	0.1406	0.3995	0.917	0.675	20	21.33	-
SVT	17.75	24.67	0.0213	0.1521	0.3936	0.906	0.665	38.67	36.67	-

GVT 和 SVT 在多样性和人工判断方面优于标准 Transformer 与 CVAE 基线。
SVT 在 MojiTalk 上以嵌入相似性（EMB_FT 与 EMB_BERT）衡量的语义相关性更高，对 Persona+ED 的结果则更细致。
GVT 一般降低重建困惑度（PPL），表明潜在信息更丰富；SVT 通过序贯潜变量进一步提升 PPL。
GVT 与 SVT 在 Dist-1/Dist-2/Dist-3 上优于基线，表明输出更具多样性。
人工评估偏好 SVT 在连贯性、情感和参与度方面的表现，且 SVT 的逐词潜变量建模在某些数据集上提升信息量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。