QUICK REVIEW

[论文解读] VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Chuang Gan, Yandong Li|arXiv (Cornell University)|Aug 15, 2017

Multimodal Machine Learning Applications参考文献 42被引用 26

一句话总结

本文提出VQS（视觉问题与分割答案），一个将COCO实例分割与VQA问题和答案关联的数据集，以在VQA中实现监督注意力，并引入一种新型问题聚焦语义分割（QFSS）任务。通过利用分割-QA关联作为显式监督，该方法在VQA真实多选基准上达到最先进性能，并展示了利用掩码聚合与DeconvNet模型实现QFSS的可行性，同时提升了注意力学习效果。

ABSTRACT

Rich and dense human labeled datasets are among the main enabling factors for the recent advance on vision-language understanding. Many seemingly distant annotations (e.g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes --- and even the same set of images (e.g., of COCO). The popularity of COCO correlates those annotations and tasks. Explicitly linking them up may significantly benefit both individual tasks and the unified vision and language modeling. We present the preliminary work of linking the instance segmentations provided by COCO to the questions and answers (QAs) in the VQA dataset, and name the collected links visual questions and segmentation answers (VQS). They transfer human supervision between the previously separate tasks, offer more effective leverage to existing problems, and also open the door for new research problems and models. We study two applications of the VQS data in this paper: supervised attention for VQA and a novel question-focused semantic segmentation task. For the former, we obtain state-of-the-art results on the VQA real multiple-choice task by simply augmenting the multilayer perceptrons with some attention features that are learned using the segmentation-QA links as explicit supervision. To put the latter in perspective, we study two plausible methods and compare them to an oracle method assuming that the instance segmentations are given at the test stage.

研究动机与目标

通过在COCO中显式链接实例分割与QA对，弥合语义分割与视觉问答（VQA）之间的差距。
通过使用分割-QA关联作为注意力机制的监督信号，实现在VQA中的监督注意力。
提出并评估一项新任务：问题聚焦语义分割（QFSS），即模型生成能视觉回答给定问题的分割结果。
探究不同问题表示方式（词嵌入与词袋）对QFSS性能的影响。
通过在传统上独立的任务之间迁移人类标注监督，建立视觉语言理解的新基准。

提出的方法

通过将实例分割与对应VQA问题和答案关联，标注COCO图像，构建VQS数据集。
利用VQS关联训练带有注意力特征的VQA模型，通过分割掩码监督注意力定位，提升注意力定位性能。
采用掩码聚合与DeconvNet构建问题聚焦语义分割（QFSS）框架，基于问题上下文生成分割掩码。
使用L2损失在预测与真实分割掩码之间训练基于DeconvNet的模型，引入问题条件注意力机制。
对比两种问题表示方案：词嵌入与词袋特征，评估其对QFSS性能的影响。
采用一种假设测试时已知真实分割掩码的“最优”方法，为QFSS建立性能上限。

实验结果

研究问题

RQ1将实例分割与VQA问题关联，能否改善视觉问答中的注意力监督？
RQ2VQS数据集在VQA真实多选基准上能否有效实现最先进性能？
RQ3能否通过分割-QA关联有效构建并评估问题聚焦语义分割（QFSS）任务？
RQ4不同问题表示方式（词嵌入与词袋）对QFSS性能有何影响？
RQ5所提出的QFSS方法与假设测试时已知完美实例分割的“最优”方法之间，性能差距如何？

主要发现

基于VQS的监督注意力方法通过在多层感知机中引入分割监督的注意力特征，在VQA真实多选基准上实现最先进性能。
QFSS的掩码聚合方法优于基线DeconvNet，但仍显著落后于“最优”上限，表明仍有改进空间。
平均而言，每个问题会选中多个分割结果，表明问题通常需要多个视觉实体才能完整回答。
词袋与词嵌入表示产生可区分的结果，表明QFSS性能对问题编码策略较为敏感。
定性结果表明，模型能正确识别如“有多少个？”这类问题所需的多个相关分割区域。
VQS数据集实现了语义分割与VQA之间的人工标注监督有效迁移，证明了在相同图像集上关联多样化标注的价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。