QUICK REVIEW

[論文レビュー] Dynamic Graph Attention for Referring Expression Comprehension

Sibei Yang, Guanbin Li|arXiv (Cornell University)|Sep 18, 2019

Multimodal Machine Learning Applications参考文献 31被引用数 24

ひとこと要約

本論文では、画像オブジェクトとその関係性の動的グラフ上で、多段階的かつ言語誘導型の視覚的推論を実行する、参照表現理解のための新規手法であるDynamic Graph Attention (DGA) を提案する。言語構造を微分可能アナライザーでモデル化し、グラフ伝播によって複合オブジェクト表現を更新することで、DGA は3つのベンチマークで最先端の性能を達成するとともに、複雑な表現に対して解釈可能で段階的な推論経路を生成する。

ABSTRACT

Referring expression comprehension aims to locate the object instance described by a natural language referring expression in an image. This task is compositional and inherently requires visual reasoning on top of the relationships among the objects in the image. Meanwhile, the visual reasoning process is guided by the linguistic structure of the referring expression. However, existing approaches treat the objects in isolation or only explore the first-order relationships between objects without being aligned with the potential complexity of the expression. Thus it is hard for them to adapt to the grounding of complex referring expressions. In this paper, we explore the problem of referring expression comprehension from the perspective of language-driven visual reasoning, and propose a dynamic graph attention network to perform multi-step reasoning by modeling both the relationships among the objects in the image and the linguistic structure of the expression. In particular, we construct a graph for the image with the nodes and edges corresponding to the objects and their relationships respectively, propose a differential analyzer to predict a language-guided visual reasoning process, and perform stepwise reasoning on top of the graph to update the compound object representation at every node. Experimental results demonstrate that the proposed method can not only significantly surpass all existing state-of-the-art algorithms across three common benchmark datasets, but also generate interpretable visual evidences for stepwisely locating the objects referred to in complex language descriptions.

研究の動機と目的

既存の参照表現理解モデルが明示的でない多段階的推論と解釈可能性を欠いているという限界を是正すること。
オブジェクト間の視覚的関係性と言語構造を統合することで、複雑な参照表現の接地を改善すること。
統一されたフレームワーク内で言語構文と視覚的グラフ構造の両方をモデル化することで、高水準で構成的な推論を可能にすること。
オブジェクトの接地に関する推論プロセスの裏にある視覚的証拠を段階的に可視化できる、解釈可能な段階的推論を生成すること。

提案手法

ノードが検出されたオブジェクトを表し、エッジがオブジェクト間の関係性を表す有向視覚グラフを構築する。
参照表現を構成的表現に段階的に分解するための微分可能アナライザーを導入する。
各ノードで動的グラフアテンションを用いて、言語誘導型の反復的推論を実行し、複合オブジェクト表現を更新する。
各推論ステップで、語、ノード、関係性のソフトアテンションを用いて、関連する言語的および視覚的コンポONENTを強調する。
マッチング損失を用いたエンド・トゥ・エンド学習により、表現と最終的なオブジェクト表現を一致させる共同表現を学習する。
言語的誘導に基づいてグラフ全体にわたってアテンションを伝播させる多段階的推論メカニズムを採用し、高次元の推論を可能にする。

実験結果

リサーチクエスチョン

RQ1複雑な参照表現の言語的構造に従って、多段階的視覚的推論を実行できるモデルは存在するか？
RQ2動的グラフにオブジェクト関係性を統合することで、複雑な表現の接地精度はどのように向上するか？
RQ3各ステップで語、ノード、関係性のアテンションを可視化することで、推論プロセスを解釈可能にすることができるか？
RQ4固定またはヒューリスティックな解析と比較して、言語的構造解析をエンド・トゥ・エンドで学習することで性能が向上するか？
RQ5参照表現理解における効果的かつ堅牢な接地を実現するための最適な推論ステップ数は何か？

主な発見

提案された DGA モデルは、3つのベンチマークデータセットすべてで最先端の性能を達成した：RefCOCO val で 86.34%、RefCOCO testA で 86.64%、RefCOCO testB で 84.79%。
RefCOCO+ では、val で 73.56%、testA で 78.31%、testB で 68.15% を達成し、すべてのベースラインを上回った。
RefCOCOg では、val で 80.21%、test で 80.26% を達成し、新たな最先端を樹立した。
アブレーションスタディの結果、3ステップの推論（DGA(3)）が最も優れた性能を示し、4ステップではノイズが導入されることが判明した。
言語解析器を備えたモデル（DGA*）は、完全な DGA よりも性能が劣っており、エンド・トゥ・エンドで学習された言語的構造解析の重要性が示された。
定性的な結果から、DGA は語、ノード、関係性の上に解釈可能なアテンションマップを生成し、段階的に推論チェーンを可視化していることがわかった。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。