QUICK REVIEW

[论文解读] Bilinear Attention Networks

Jin-Hwa Kim, Jae-Hyun Jun|arXiv (Cornell University)|May 21, 2018

Multimodal Machine Learning Applications参考文献 28被引用 78

一句话总结

BAN 使用低秩池化与残差注意力，在多通道视觉-语言输入上学习双线性注意力图，在 VQA 2.0 和 Flickr30k 实体数据集上达到最先进的结果。

ABSTRACT

Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.

研究动机与目标

通过建模模态通道之间的交互，推动视觉和语言的融合超越共注意力。
提出一个双线性注意力机制，对两组输入通道进行联合注意。
引入残差学习方案以高效利用多个双线性注意力图。
在 VQA 2.0 和 Flickr30k Entities 上评估 BAN，以确立最先进的性能和定位能力。

提出的方法

在两个多通道输入 X 和 Y 之间定义一个双线性注意力图 A，并通过低秩双线性池化计算联合表征。
用对双线性得分的 softmax 对 A 进行参数化，该得分使用 Hadamard 乘积和低秩投影 (U, V, p)。
通过学习具有共享 U、V 和不同 p_g 的 A_g，扩展到多个窥视。
应用一种多模态残差网络变体，在不拼接的情况下整合多个 BAN 映射，使之能够进行 8-窥视学习。
在特征交互和注意力中均使用 ReLU 非线性，并为 VQA 设置两层 MLP 分类器，为定位任务使用 BCE 损失。

实验结果

研究问题

RQ1是否能通过双线性注意力比共注意力或统一注意力更有效地捕捉视觉和语言通道之间的交互？
RQ2残差整合的多双线性注意力图是否能提高准确性和效率？
RQ3BAN 在 VQA 2.0 和 Flickr30k Entities 上在准确性和定位速度方面的表现如何？
RQ4窥视数量对性能和鲁棒性有何影响？

主要发现

带有双线性注意力图的 BAN 在 VQA 2.0 验证集上优于统一注意力和共注意力。
增加窥视数量可提升 VQA 验证分数（BAN-1: 65.36, BAN-2: 65.61, BAN-4: 65.81, BAN-8: 66.00, BAN-12: 66.04）。
注意力的残差学习比多 BAN 映射的求和或拼接融合得到更好的结果。
在 Flickr30k 实体数据集中，BAN 达到 69.69% Recall@1，超过此前方法且无需额外特征，推断速度提升 25.37%（0.67 ms/实体）。
BAN 展示出具备竞争力的视觉定位能力，并在八个窥视的情况下保持参数高效性。
该模型在 VQA 2.0 和 Flickr30k Entities 基准测试上均达到最先进行的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。