QUICK REVIEW

[论文解读] Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model

Jiasen Lu, Anitha Kannan|arXiv (Cornell University)|Jun 5, 2017

Multimodal Machine Learning Applications参考文献 47被引用 86

一句话总结

本文提出了一种训练框架，将辨别式视觉对话模型的知识通过 Gumbel-Softmax 转移给生成模型，从而通过使生成模型能够产生更具多样性和信息性的回答，在 VisDial 上实现了性能提升。

ABSTRACT

We present a novel training framework for neural sequence models, particularly for grounded dialog generation. The standard training paradigm for these models is maximum likelihood estimation (MLE), or minimizing the cross-entropy of the human responses. Across a variety of domains, a recurring problem with MLE trained generative neural dialog models (G) is that they tend to produce 'safe' and generic responses ("I don't know", "I can't tell"). In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses. However, D is not useful in practice since it cannot be deployed to have real conversations with users. Our work aims to achieve the best of both worlds -- the practical usefulness of G and the strong performance of D -- via knowledge transfer from D to G. Our primary contribution is an end-to-end trainable generative visual dialog model, where G receives gradients from D as a perceptual (not adversarial) loss of the sequence sampled from G. We leverage the recently proposed Gumbel-Softmax (GS) approximation to the discrete distribution -- specifically, an RNN augmented with a sequence of GS samplers, coupled with the straight-through gradient estimator to enable end-to-end differentiability. We also introduce a stronger encoder for visual dialog, and employ a self-attention mechanism for answer encoding along with a metric learning loss to aid D in better capturing semantic similarities in answer responses. Overall, our proposed model outperforms state-of-the-art on the VisDial dataset by a significant margin (2.67% on recall@10). The source code can be downloaded from https://github.com/jiasenlu/visDial.pytorch.

研究动机与目标

解决在最大似然估计（MLE）训练的生成式视觉对话模型中出现的安全、通用回答问题。
通过利用辨别式模型作为感知损失来源，实现对生成器的端到端训练。
提出一种新颖的编码器（HCIAE）和基于注意力的答案编码，以改进对 grounding 和共参照解析。
使用带直通估计的 Gumbel-Softmax 在离散序列上进行反向传播。
在 VisDial 数据集上证明优于现有方法的性能。

提出的方法

端到端生成模型 G 通过对从 G 采样的序列的感知损失，接收来自辨别式模型 D 的梯度。
带直通估计的 Gumbel-Softmax（GS）实现对离散序列生成的可微分训练。
History-Conditioned Image Attentive Encoder（HCIAE）在对话历史和图像上进行注意，以产生联合嵌入。
用于 D 的度量学习多类 N-pair 损失，以学习感知相似性和多种有效回答。
辨别器感知损失 L_G 鼓励 G 产生在 D 下得分高于真实 Ground-Truth 的序列。
自注意力答案编码和增强的编码器提高回答中的 grounding 与语义相似度。

实验结果

研究问题

RQ1从辨别式视觉对话模型进行的知识转移是否能在多样性和信息性方面改进生成式对话模型？
RQ2所提出的 HCIAE 编码器是否通过对历史与视觉内容的共指提升了 grounding？
RQ3通过 Gumbel-Softmax 进行端到端训练在视觉对话中的离散序列生成是否可行且有益？
RQ4度量学习损失和自注意力如何影响辨别器质量和生成器性能？
RQ5训练动力学（非对抗知识转移与对抗微调）对最终对话质量有何影响？

主要发现

模型	MRR	R@1	R@5	R@10	Mean
HCIAE-G-MLE	0.5386	44.06	63.55	69.24	16.01
HCIAE-G-DIS	0.5467	44.35	65.28	71.55	14.23
HCIAE-D-MLE	0.6140	47.73	77.50	86.35	5.15
HCIAE-D-NP	0.6182	47.98	78.35	87.16	4.92
HCIAE-D-NP-ATT	0.6222	48.48	78.75	87.59	4.81

在辨别器引导下训练的生成模型 G-DIS 在 VisDial 上优于 MLE 基线（R@5 提升至 65.28，R@10 提升至 71.55）。
结合 HCIAE 编码器的 G-DIS 达到 0.5467 MRR 和 44.35, 65.28, 71.55 (R@1, R@5, R@10)，Mean 14.23，超过 HCIAE-G-MLE（MRR 0.5386）。
带 NP 损失和注意力答案编码的辨别模型变体获得较强结果（D-NP-ATT：MRR 0.6222；R@1 48.48；R@5 78.75；R@10 87.59；Mean 4.81）。
从 D 到 G 的知识转移在仅改进编码器的基础上带来显著提升（HCIAE-G-DIS 在 R@5 上胜过 HCIAE-G-MLE 1.7%）。
在 GAN 设置中对 D 进行对抗性继续训练会降低性能，表明事先训练好的 D 提供感知结构对有效转移至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。