QUICK REVIEW

[论文解读] Reciprocal Attention Fusion for Visual Question Answering.

Moshiur Farazi, Salman Khan|arXiv (Cornell University)|Jan 1, 2018

Multimodal Machine Learning Applications被引用 8

一句话总结

本文提出了一种用于视觉问答（VQA）的互 attention 融合机制，通过自下而上和自上而下的注意力机制，联合建模对象级与网格级视觉特征之间的关系。通过张量分解分层融合多模态特征，该模型在单模型设置下实现了最先进性能，在 VQAv1 上达到 68.2% 的准确率，在 VQAv2 上达到 67.4%。

ABSTRACT

Existing attention mechanisms either attend to local image grid or object level features for Visual Question Answering (VQA). Motivated by the observation that questions can relate to both object instances and their parts, we propose a novel attention mechanism that jointly considers reciprocal relationships between the two levels of visual details. The bottom-up attention thus generated is further coalesced with the top-down information to only focus on the scene elements that are most relevant to a given question. Our design hierarchically fuses multi-modal information i.e., language, object- and gird-level features, through an efficient tensor decomposition scheme. The proposed model improves the state-of-the-art single model performances from 67.9% to 68.2% on VQAv1 and from 65.7% to 67.4% on VQAv2, demonstrating a significant boost.

研究动机与目标

为解决现有 VQA 模型仅关注局部图像网格或对象级特征、忽略细粒度视觉关系的局限性。
通过建模对象实例与其部分之间的互 attention 关系，提升 VQA 性能。
开发一种分层融合机制，高效整合语言、对象级与网格级特征。
在标准 VQA 基准上实现单模型的最先进结果。

提出的方法

提出一种互 attention 机制，用于建模对象级与网格级视觉特征之间的双向依赖关系。
使用自下而上的注意力机制，在对象级与网格级生成视觉表征。
整合自上而下的问题引导注意力机制，以精炼对相关场景元素的关注。
采用高效的张量分解方案，分层融合多模态特征（语言、对象、网格）。
利用融合后的特征，预测与输入问题更相关的结果。

实验结果

研究问题

RQ1建模对象实例与其部分之间的互关系是否能提升 VQA 性能？
RQ2对对象级与网格级特征进行联合注意力是否能增强 VQA 中的视觉定位能力？
RQ3通过张量分解实现多模态特征的分层融合，在 VQA 基准上能将准确率提升多少？
RQ4单模型架构是否能在不使用集成技术的情况下超越先前的最先进方法？

主要发现

所提模型在 VQAv1 数据集上实现了 68.2% 的新单模型最先进性能。
在 VQAv2 基准上，模型准确率提升至 67.4%，显著优于先前方法。
互 attention 机制有效捕捉了与问题相关的全局与细粒度视觉细节。
基于张量分解的融合方案实现了多模态特征的高效且有效的整合。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。