QUICK REVIEW

[论文解读] Deep Modular Co-Attention Networks for Visual Question Answering

Yu Zhou, Jun Yu|arXiv (Cornell University)|Jun 25, 2019

Multimodal Machine Learning Applications参考文献 33被引用 99

一句话总结

MCAN 引入了深度模块化协同注意力层，将自注意力与引导注意力融合，用于问题和图像，结合深编码器-解码器或堆叠设计，取得 VQA-v2 的最先进结果。

ABSTRACT

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63$\%$ overall accuracy on the test-dev set. Code is available at https://github.com/MILVLG/mcan-vqa.

研究动机与目标

旨在通过学习图像区域和问题词之间的密集交互来提升 VQA 的细粒度多模态理解。
设计一个深度架构，通过堆叠模块化协同注意力层来逐步细化跨模态表征。
研究在两种模态中使用自注意力的益处，以及深度协同注意力在视觉推理和计数任务中的作用。

提出的方法

引入组合自注意力（SA）与引导注意力（GA）单元的模块化协同注意力（MCA）层。
建模两种基本注意力单元：用于模态内交互的 SA（词-词或区域-区域），以及用于模态间交互的 GA（问题词到图像区域）。
级联多个 MCA 层以形成具有堆叠和编码器-解码器变体的深度 MCAN。
用来自 Faster R-CNN 的自底向上区域特征表示图像，用词嵌入（GloVe）表示问题，再经 LSTM 得到问题特征矩阵。
在 SA 和 GA 单元中使用带残差连接和层归一化的多头缩放点积注意力。
通过堆叠或编码器-解码器策略，使用 L MCA 层（L ∈ {1,2,4,6,8}）进行深度协同注意力学习，将输出送入两层注意力降维和线性多模态融合，以通过 BCE 3,129 路分类器预测答案。

Figure 1 : Accuracies vs . co-attention depth on VQA-v2 val split. We list most of the state-of-the-art approaches with (deep) co-attention models. Except for DCN [ 24 ] which uses the convolutional visual features and thus leads to inferior performance, all the compared methods ( i.e. , MCAN, BAN [

实验结果

研究问题

RQ1深度级联的 MCA 层是否在 VQA 表现上优于浅层协同注意力模型？
RQ2在图像和问题模态中使用自注意力对 VQA 精度的影响如何，包括对象计数？
RQ3堆叠式与编码器-解码器深度协同注意力模型在性能和优化稳定性方面的比较？
RQ4提出的 MCAN 融合与分类器设计在 VQA-v2 基准测试上的有效性如何？
RQ5不同的问题表示（GloVe、随机、LSTM）对结果的影响有多大？

主要发现

具有深度 MCA 层的 MCAN 在 VQA-v2 上显著优于先前的协同注意力模型。
在问题和图像区域上的自注意力提升了性能，其中 SA(Y)-SGA(X,Y) 产生了较强的结果。
随着深度增加，编码器-解码器深度协同注意力通常优于堆叠方式，因为能更好地利用分层表示。
最佳单模型（MCAN ed-6）在 VQA-v2 test-dev 拓分上达到 70.63% 的总精度，在 test-std 上达到 70.90%，并具有竞争力的计数能力。
与 BAN 和 MFH 相比，MCAN 参数高效（例如 MCAN ed-2 约 27M 参数），同时提供更高的准确性。
可视化显示学习到的注意力与关键词和相关图像区域对齐，图像自注意力通过聚焦对象区域来改善计数。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。