QUICK REVIEW

[論文レビュー] A Focused Dynamic Attention Model for Visual Question Answering

Ilija Ilievski, Shuicheng Yan|arXiv (Cornell University)|Apr 6, 2016

Multimodal Machine Learning Applications参考文献 24被引用数 130

ひとこと要約

FDA は、オブジェクト領域に対する質問誘導型の焦点付き動的注意を用い、質問を介して局所およびグローバルな視覚特徴をLSTMと組み合わせ、オープンエンドおよび多肢選択の VQA ベンチマークで最先端の成果を達成します。

ABSTRACT

Visual Question and Answering (VQA) problems are attracting increasing interest from multiple research disciplines. Solving VQA problems requires techniques from both computer vision for understanding the visual contents of a presented image or video, as well as the ones from natural language processing for understanding semantics of the question and generating the answers. Regarding visual content modeling, most of existing VQA methods adopt the strategy of extracting global features from the image or video, which inevitably fails in capturing fine-grained information such as spatial configuration of multiple objects. Extracting features from auto-generated regions -- as some region-based image recognition methods do -- cannot essentially address this problem and may introduce some overwhelming irrelevant features with the question. In this work, we propose a novel Focused Dynamic Attention (FDA) model to provide better aligned image content representation with proposed questions. Being aware of the key words in the question, FDA employs off-the-shelf object detector to identify important regions and fuse the information from the regions and global features via an LSTM unit. Such question-driven representations are then combined with question representation and fed into a reasoning unit for generating the answers. Extensive evaluation on a large-scale benchmark dataset, VQA, clearly demonstrate the superior performance of FDA over well-established baselines.

研究の動機と目的

VQA のグローバルな画像特徴を超えた視覚コンテンツモデリングの改善を動機づける。
関連する画像領域に焦点を当てる質問駆動の注意機構を開発する。
焦点を絞った領域特徴とグローバルな画像コンテキストおよび質問表現を融合させる。
ベースラインおよび既存の注意モデルに対して大規模 VQA ベンチマークで性能向上を示す。

提案手法

画像からグローバルおよび領域ベースの CNN 特徴を抽出する。
オブジェクト検出器を用いて質問に関連する候補領域を識別する。
画像領域と全体的な文脈を、質問単語の順序で視覚情報を符号化する LSTM への入力として表現する。
質問を LSTM で符号化して質問表現を得る。
焦点を当てた動的注意機構を適用し、領域特徴を質問単語の順序に従って並べ、グローバル特徴と組み合わせる。
tanh および ReLU 活性化を介して質問と視覚表現を融合し、次に要素ごとの乗算とフィードフォワードネットワークを介して SoftMax による1000個の最も一般的な回答を予測する。

実験結果

リサーチクエスチョン

RQ1質問駆動のオブジェクト中心の画像領域への焦点は、グローバルまたは非焦点注意手法と比べて VQA の正確さを改善するか。
RQ2局所化された領域特徴とグローバルコンテキストの両方を組み込むと、オープンエンドおよび多肢選択の VQA タスクにどのように影響するか。
RQ3質問と焦点を絞った視覚特徴の LSTM ベースの融合は VQA ベンチマークで最先端の結果を達成できるか。

主な発見

FDA は open-ended および multiple-choice タスクで VQA データセットの最先端の性能を達成する。
Open-ended test-dev: FDA 59.24 (All), 81.14 (Y/N), 45.77 (Other), 36.16 (Num); test-std: 59.54 (All)。
Multiple-choice test-dev: FDA 64.01 (All), 81.50 (Y/N), 54.72 (Other), 39.00 (Num); test-std: 64.18 (All)。
FDA は open-ended で SAN ベースラインを約0.6%上回り、 multiple-choice で約1.1%上回る。
定性的結果は、関連領域にモデルが焦点を当てることで色、数え、物体識別の質問に対する正確さが向上することを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。