QUICK REVIEW

[论文解读] Adaptively Aligned Image Captioning via Adaptive Attention Time

Lun Huang, Wenmin Wang|arXiv (Cornell University)|Sep 19, 2019

Multimodal Machine Learning Applications被引用 39

一句话总结

本论文提出自适应注意力时间（AAT），是一种可微分的机制，在图像描述生成中按解码步骤自适应决定要进行多少次注意步骤，优于固定单步注意和循环注意模型。

ABSTRACT

Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word, assuming one-to-one mapping from source image regions and target caption words, which is never possible. In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. AAT allows the framework to learn how many attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions. AAT is deterministic and differentiable, and doesn't introduce any noise to the parameter gradients. In this paper, we empirically show that AAT improves over state-of-the-art methods on the task of image captioning. Code is available at https://github.com/husthuaan/AAT.

研究动机与目标

通过解决标准注意力模型中的一对一图像区域到词的假设来提升图像描述生成。
使每个词的注意步骤具有自适应性，以灵活对齐图像区域与描述。
在解码期间允许自适应计算的同时，保持可微分性和稳定性。

提出的方法

提出自适应注意力时间（AAT），学习在每个解码步骤应执行多少次注意步骤。
将 AAT 嵌入到一个两层 LSTM 编码-解码器中，配备一个能为每个词执行多次关注步骤的注意模块。
使用一个置信网络来决定何时停止关注并输出一个词，灵感来自 Adaptive Computation Time (ACT)。
引入多头注意力，以更好地捕捉图像区域之间的相互作用。
在训练中添加时间成本惩罚，以在准确性和计算量之间取得平衡。
提供连接，显示基础、循环和自适应注意力模型作为 AAT 的特例。

实验结果

研究问题

RQ1在每个解码步骤采用自适应的注意力步骤是否能在图像描述生成质量上超越单步或固定步长的注意力模型？
RQ2就注意时间而言，AAT 如何权衡图像描述质量与计算成本？
RQ3在该框架中，注意力头数以及加性注意力与点乘注意力的影响如何？
RQ4自适应注意机制是否能推广到图像描述以外的其他编码-解码任务？

主要发现

模型	交叉熵 BLEU-4	交叉熵 METEOR	交叉熵 ROUGE	交叉熵 CIDEr-D	交叉熵 SPICE	自我批评 BLEU-4	自我批评 METEOR	自我批评 ROUGE	自我批评 CIDEr-D	自我批评 SPICE
LSTM	29.6	25.2	52.6	94.0	-	31.9	25.5	54.3	106.3	-
ADP-ATT	33.2	26.6	-	108.5	-	-	-	-	-	-
SCST	30.0	25.9	53.4	99.4	-	34.2	26.7	55.7	114.0	-
Up-Down	36.2	27.0	56.4	113.5	20.3	36.3	27.7	56.9	120.1	21.4
RFNet	35.8	27.4	56.8	112.5	20.5	36.5	27.7	57.3	121.9	21.2
GCN-LSTM	36.8	27.9	57.0	116.3	20.9	38.2	28.5	58.3	127.6	22.0
SGAE	-	-	-	-	-	38.4	28.4	58.6	127.8	22.1
AAT (Ours)	37.0	28.1	57.3	117.2	21.2	38.7	28.6	58.5	128.6	22.2

AAT 在 MS COCO（Karpathy 划分）上，在 METEOR、CIDEr-D 和 SPICE 上优于基础和循环注意力模型，平均每个解码步骤有 2.55 次注意步骤。
当 lambda = 1e-4 时，AAT 取得强劲性能，同时保持相对较低的平均注意步骤数（在消融中为 2.54–2.84）。
多头加性注意力（8 头）实现最佳平衡，在自我批评训练中达到 CIDEr-D 128.6 和 SPICE 22.2。
相较于 Up-Down（此前的 SOTA），在两阶段训练中 AAT 在 BLEU-4、METEOR、ROUGE-L、CIDEr-D 和 SPICE 上取得显著提升。
在他们的结果中，单个 AAT 模型在 MS COCO 测试集上达到 128.6 CIDEr-D，显示当时的最先进性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。