QUICK REVIEW

[論文レビュー] Scene Graph Generation via Conditional Random Fields

Weilin Cong, William Yang Wang|arXiv (Cornell University)|Nov 20, 2018

Multimodal Machine Learning Applications参考文献 26被引用数 18

ひとこと要約

本稿では、シーングラフにおける主語・目的語の順序と意味的整合性をモデル化することで、関係予測を向上させる新規なシーングラフ生成モデルSG-CRFを提案する。条件付きランダムフィールド（CRF）を活用することで、CLEVRでは49.95%、VRDでは50.47%、Visual Genomeでは54.77%のRecall@100という最先端の性能を達成した。

ABSTRACT

Despite the great success object detection and segmentation models have achieved in recognizing individual objects in images, performance on cognitive tasks such as image caption, semantic image retrieval, and visual QA is far from satisfactory. To achieve better performance on these cognitive tasks, merely recognizing individual object instances is insufficient. Instead, the interactions between object instances need to be captured in order to facilitate reasoning and understanding of the visual scenes in an image. Scene graph, a graph representation of images that captures object instances and their relationships, offers a comprehensive understanding of an image. However, existing techniques on scene graph generation fail to distinguish subjects and objects in the visual scenes of images and thus do not perform well with real-world datasets where exist ambiguous object instances. In this work, we propose a novel scene graph generation model for predicting object instances and its corresponding relationships in an image. Our model, SG-CRF, learns the sequential order of subject and object in a relationship triplet, and the semantic compatibility of object instance nodes and relationship nodes in a scene graph efficiently. Experiments empirically show that SG-CRF outperforms the state-of-the-art methods, on three different datasets, i.e., CLEVR, VRD, and Visual Genome, raising the Recall@100 from 24.99% to 49.95%, from 41.92% to 50.47%, and from 54.69% to 54.77%, respectively.

研究の動機と目的

現存するシーングラフ生成手法が、現実世界の曖昧なシーンにおける主語と目的語を区別できないという限界を解決すること。
視覚的質問応答、画像キャプション作成、意味的画像検索などの認知的ビジョンタスクにおける性能を向上させること。
従来の手法よりも、関係トリプレットにおける主語と目的語の順序をより効果的にモデル化すること。
シーングラフにおけるオブジェクトインスタンスと関係の間の意味的整合性を向上させ、より良い推論と理解を実現すること。

提案手法

SG-CRFは、関係トリプレットにおける主語と目的語の順序をモデル化するために条件付きランダムフィールド（CRF）を採用する。
モデルは、シーングラフ内のオブジェクトノードと関係ノードの間の意味的整合性を明示的に学習する。
予測された関係における妥当な主語・目的語の順序を保証するため、構造的制約をCRFフレームワークに統合する。
構造的予測フレームワークを用いて、オブジェクト検出と関係予測を同時に最適化する。
バックプロパゲーションによるエンドツーエンド学習を可能にする微分可能CRF層を採用する。
曖昧なオブジェクトインスタンスに対処するため、意味的に整合性があり順序が整ったトリプレットを優先するようにアーキテクチャを設計する。

実験結果

リサーチクエスチョン

RQ1現実世界の画像における主語・目的語の曖昧性は、どのようにシーングラフ生成において効果的に解消できるか？
RQ2主語と目的語の順序をモデル化することで、関係予測の性能は向上するか？
RQ3オブジェクトと関係の間の意味的整合性を強制することで、シーングラフの質はどの程度向上するか？
RQ4CRFのような構造的予測アプローチは、自己回帰的または独立した予測手法よりも、シーングラフ生成で優れた性能を発揮するか？

主な発見

CLEVRデータセットでは、SG-CRFがRecall@100で49.95%を達成し、前回の最先端手法の24.99%から顕著な向上を示した。
VRDデータセットでは、Recall@100が41.92%から50.47%に向上し、優れた一般化性能を示した。
Visual Genomeデータセットでは、SG-CRFがRecall@100で54.77%を達成し、わずかに前回の最先端手法を上回った。
性能向上の要因は、主語・目的語の順序と意味的整合性を効果的に学習できる能力に起因する。
結果から、CRFによる構造的予測は、従来手法よりも一貫性があり正確なシーングラフを生成できることを示した。
本手法は、合成データ（CLEVR）、実世界データ（VRD）、複雑なデータ（Visual Genome）を含む多様なデータセットにわたって、良好な一般化性能を示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。