QUICK REVIEW

[论文解读] Exploring Human-like Attention Supervision in Visual Question Answering

Tingting Qiao, Jianfeng Dong|arXiv (Cornell University)|Sep 19, 2017

Multimodal Machine Learning Applications被引用 45

一句话总结

本文提出人类注意力网络（HAN），通过在VQA-HAT数据集上进行训练，为视觉问答（VQA）生成类人注意力图，并基于VQA v2.0构建了类人注意力（HLAT）数据集。通过将这些类人注意力图作为监督信号，模型在注意力准确率方面得到提升，在VQA整体准确率上相较无监督基线模型实现了0.15%的绝对提升。

ABSTRACT

Attention mechanisms have been widely applied in the Visual Question Answering (VQA) task, as they help to focus on the area-of-interest of both visual and textual information. To answer the questions correctly, the model needs to selectively target different areas of an image, which suggests that an attention-based model may benefit from an explicit attention supervision. In this work, we aim to address the problem of adding attention supervision to VQA models. Since there is a lack of human attention data, we first propose a Human Attention Network (HAN) to generate human-like attention maps, training on a recently released dataset called Human ATtention Dataset (VQA-HAT). Then, we apply the pre-trained HAN on the VQA v2.0 dataset to automatically produce the human-like attention maps for all image-question pairs. The generated human-like attention map dataset for the VQA v2.0 dataset is named as Human-Like ATtention (HLAT) dataset. Finally, we apply human-like attention supervision to an attention-based VQA model. The experiments show that adding human-like supervision yields a more accurate attention together with a better performance, showing a promising future for human-like attention supervision in VQA.

研究动机与目标

解决大规模VQA数据集中缺乏人类标注的注意力图的问题。
探究人类注意力模式是否能够提升基于注意力的VQA模型性能。
开发一种在大规模VQA任务中生成合成类人注意力图的方法。
评估显式类人注意力监督在提升VQA模型性能方面的有效性。
创建并发布HLAT数据集，作为VQA中注意力监督的基准数据集。

提出的方法

在VQA-HAT数据集上训练人类注意力网络（HAN），从图像-问题对中预测类人注意力图。
使用门控循环单元（GRU）将预训练VQA模型生成的多个注意力图编码为更精细的类人注意力图。
将预训练的HAN应用于整个VQA v2.0数据集，生成大规模类人注意力图数据集，命名为HLAT。
在基于注意力的VQA模型训练过程中，将HLAT数据集作为真实标签进行监督。
训练带有和不带类人注意力监督的VQA模型，以进行性能对比。
使用标准VQA准确率指标（包括基于共识的评分）评估模型性能。

实验结果

研究问题

RQ1人类注意力模式是否能够提升基于注意力的VQA模型性能？
RQ2人类注意力所突出的区域是否对应于回答问题更准确、更相关的视觉特征？
RQ3深度学习模型能否从有限的人类标注数据中学习生成类人注意力图？
RQ4使用合成类人注意力图进行显式监督是否能带来更优的注意力定位效果和更高的VQA准确率？
RQ5注意力图的质量如何影响模型回答复杂问题（如计数或推理任务）的能力？

主要发现

在使用两个关注区域时，采用类人注意力监督的VQA模型相较无监督基线模型在整体准确率上实现了0.15%的绝对提升。
在使用一个关注区域时，带监督的模型相较无监督模型实现了0.11%的准确率提升。
由监督模型生成的注意力图在可视化中显示更准确、更聚焦于相关图像区域。
使用GRU编码注意力图的HAN模型优于未使用GRU的版本，证明了序列建模在优化注意力方面的有效性。
通过HAN生成的HLAT数据集为VQA提供了大规模、合成的类人注意力图资源，已公开发布供研究使用。
监督模型在计数类问题上表现出显著更高的准确率，表明其在复杂推理任务中注意力定位精度得到提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。