QUICK REVIEW

[論文レビュー] MAttNet: Modular Attention Network for Referring Expression Comprehension

Licheng Yu, Zhe Lin|arXiv (Cornell University)|Jan 24, 2018

Multimodal Machine Learning Applications参考文献 27被引用数 89

ひとこと要約

MAttNet は referring expressions を subject, location, and relationship modules に分解し、言語誘導アテンションと視覚アテンションを用いて、外部パーサーを使わずに最先端のバウンディングボックスおよびピクセルレベルの理解を達成します。

ABSTRACT

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo and code are provided.

研究の動機と目的

表現のばらつきに対処するモジュール型ネットワークで referring expression 理解を向上させる。
外部言語パーサーへの依存を排除し、表現をモジュールへソフトにパースすることを学習する。
モジュール固有の視覚アテンションと適応的結合を通じて高いローカライズおよびセグメンテーション精度を達成する。

提案手法

表現を subject・location・relationship モジュール用の3つのフレーズ埋め込みに分解する。
言語アテンションネットワークを用いて、外部パーサーなしにモジュール重みと語/フレーズアテンションを学習する。
3つの視覚モジュールを、異なるアテンション機構で運用する（subject はボックス内ソフトアテンション; relationship はボックス外のハードアテンション）。
モジュール固有のスコア S(o|q) を計算し、学習されたモジュール重み w_subj, w_loc, w_rel で集約して S(o|r) を得る。
正例/負例ペアと属性認識を取り入れた subject ブランチを含むランキング損失で訓練し、エンドツーエンド学習を行う。

実験結果

リサーチクエスチョン

RQ1外部パーサーなしで、モジュール化されたエンドツーエンドモデルは refering expression 理解を改善できるか。
RQ2subject, location, および relationship 情報は localization と segmentation の性能にどう寄与するか。
RQ3学習された言語アテンションは情報を適切な視覚モジュールへ効果的に割り当てるか。
RQ4in-box アテンション vs out-of-box アテンションが理解精度に与える影響は何か。

主な発見

モデル	バックボーンネット	分割	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	IoU
Matching:subj+loc	vgg16	val	63.15	63.53	59.87	-	-	56.51
MAttN:subj+loc	vgg16	val	63.07	65.04	61.77	-	-	56.51
MAttN:subj+loc(+dif)	vgg16	val	63.07	65.77	64.55	-	-	56.51
MAttN:subj+loc(+dif)+rel	vgg16	val	65.84	66.59	65.08	-	-	66.? (IoU shown in table)
MAttN:subj(+attr)+loc(+dif)+rel	vgg16	val	68.34	69.93	65.90	-	-	66.17
MAttN:subj(+attr+attn)+loc(+dif)+rel	vgg16	val	71.01	75.13	66.17	-	-	78.12
parser+MAttN:subj(+attr+attn)+loc(+dif)+rel	vgg16	val	66.08	68.30	62.94	-	-	73.72
MAttNet:subj+loc	res101-frcn	val	72.72	76.17	68.18	-	-	63.74
MAttNet:subj+loc(+dif)+rel	res101-frcn	val	73.25	76.77	68.44	-	-	64.01
MAttNet:subj(+attr)+loc(+dif)+rel	res101-frcn	val	74.51	77.81	68.39	-	-	65.19
MAttNet:subj(+attr+attn)+loc(+dif)+rel	res101-frcn	val	76.40	80.43	69.28	-	-	67.01

MAttNet は境界ボックスの局在化とピクセルレベルのセグメンテーションで従来の最先端手法を大きく上回る。
ソフト言語パーシングと適応モジュール重みを用いたエンドツーエンド訓練は、単一モデルのベースラインに対して大きな改善をもたらす。
属性を意識したフレーズ誘導のボックス内アテンションを持つ subject モジュールは、外観に焦点を当てた表現で特に精度を向上させる。
out-of-box アテンションと MIL スタイルの最大プーリングを備えた relationship モジュールはオブジェクト間の関係の取り扱いを強化する。
Faster R-CNN / Mask R-CNN を用いた検出提案での完全自動認識は、データセット間で強い改善を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。