QUICK REVIEW

[論文レビュー] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson, Xiaodong He|arXiv (Cornell University)|Jul 25, 2017

Multimodal Machine Learning Applications参考文献 55被引用数 94

ひとこと要約

本論文は combined bottom-up (region proposals from Faster R-CNN) と top-down attention mechanism を結合したアプローチを提案し、顕著な画像領域に対する注意を画像キャプション生成と Visual Question Answering に適用します。この手法は MSCOCO キャプション生成で最先端の成果を達成し、2017 VQA Challenge で優勝しました。

ABSTRACT

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

研究の動機と目的

固定グリッドではなく、物体レベルおよび顕著な領域レベルで画像コンテンツへの注意を促進すること。
Faster R-CNN を介して領域ベースの特徴を提案するボトムアップ注意機構を開発する。
キャプショニングと VQA の性能向上のために、ボトムアップ領域とトップダウン注意機構を統合する。
領域ベースの注意が標準的な評価指標全体で改善をもたらすことを示す。

提案手法

画像特徴 V を、ResNet-101 を用いた bottom-up Faster R-CNN によって生成される領域特徴の集合として定義し、objectness > threshold の領域を選択する。
タスク文脈（キャプショニングまたは VQA）に条件づけて V over の注意重みを計算するトップダウン注意機構を用いる。
キャプショニングでは、トップダウン注意用と言語モデル用の2つの LSTM を用い、V に対するソフトアテンションを併用する。
VQA の場合、注意重みに付けられた画像特徴を用いた結合マルチモーダル埋め込みを実装し、固定語彙に対して回答を予測する。
交差エントロピー loss で訓練し、CIDEr スコアを最適化するために Self-Critical Sequence Training (SCST) で改善する。
任意で ResNet-baseline と比較して、ボトムアップ注意による利得を定量化する。

実験結果

リサーチクエスチョン

RQ1グリッドベースの注意と比較して、ボトムアップ領域ベースの注意は画像キャプショニングの品質にどのような影響を与えるか？
RQ2同じボトムアップ注意フレームワークは Visual Question Answering の性能を改善できるか？
RQ3キャプションと VQA における物体レベルの注意は、物体、属性、関係の識別にどのように影響するか？

主な発見

ボトムアップ注意は MSCOCO における CIDEr、SPICE、BLEU-4 などの指標で画像キャプショニングに大幅な改善をもたらし、最先端の結果を達成する。
MSCOCO Karpathy test split で、Up-Down（ボトムアップ注意付き）は ResNet baseline を 3–8% 各指標で上回る。
VQA は 2017 VQA Challenge で第一位を獲得し、VQA v2.0 test-standard server での総合正答率は 70.3% 。
定性的な注意の可視化は、細かなディテールと大きな領域の両方にモデルが注意していることを示し、語レベルのグラウンディングを改善する。
ResNet ベースラインと比較して、Up-Down モデルは VQA v2.0 の検証およびテストセットで Yes/No、Number、Other の質問タイプを改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。