QUICK REVIEW

[論文レビュー] Visual Relationship Detection with Language Priors

Cewu Lu, Ranjay Krishna|arXiv (Cornell University)|Jul 31, 2016

Multimodal Machine Learning Applications参考文献 35被引用数 112

ひとこと要約

この研究は、物体と述語の視覚的外観を学習し、言語 priors を用いて数千の視覚的関係を予測・局在化するスケーラブルなモデルを提案し、ゼロショットと改善された画像検索を可能にする。

ABSTRACT

Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. "man riding bicycle" and "man pushing bicycle"). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. "man" and "bicycle") and predicates (e.g. "riding" and "pushing") independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.

研究の動機と目的

共通の少数の関係だけでなく、多様な視覚的関係の偏りのない検出と局在化を動機づける。
物体と述語の視覚外観を学習し、それらを融合して関係を予測する二部構成モデルを提案する。
語彙ベクトルを用いた言語埋め込みモジュールを導入し、類似する関係を関連付ける。
ゼロショットの視覚的関係検出を実証し、内容ベースの画像検索の改善を示す。
視覚的関係予測をベンチマークするため、数千の関係タイプを含む新しいデータセットを提供する。

提案手法

CNNs (VGG) および RCNN 提案を用いて、別々の物体検出器と述語検出器を訓練する。
Model visual relationships as V(R) = P_i(O1) * (z_k^T CNN(O1,O2) + s_k) * P_j(O2).
Project object pairs into a language embedding space with f(R) = w_k^T [word2vec(t_i), word2vec(t_j)] + b_k.
距離加重埋め込みの分散を最小化して意味的な類似性を促進する（K(W)）。
観測された関係を未観測のものより上位に配置するよう、ランキング損失 L(W) を課す。
訓練の総合目的として V、L、K を結合する（C + λ1 L + λ2 K）。
テスト時には、各物体対について R* = argmax_R V(R,Θ|O1,O2) f(R,W) をスコアリングする。

実験結果

リサーチクエスチョン

RQ1視覚的関係は、独立に学習した物体/述語の外観と言語 priors を組み合わせることで検出可能か。
RQ2埋め込みベースの言語 priors は、特に頻度が低い関係や unseen な関係の認識にどのように影響するか。
RQ3提案モデルは数千の関係タイプにスケールし、ゼロショット学習をサポートするか。
RQ4関係を活用することで画像検索性能は改善されるか。
RQ5新しい大規模な視覚的関係データセットに対して、従来法と比較してモデルはどの程度の性能を示すか。

主な発見

Phrase Det. R@100	Phrase Det. R@50	Relationship Det. R@100	Relationship Det. R@50	Predicate Det. R@100	Predicate Det. R@50
0.07	0.04	-	-	1.91	0.97
0.09	0.07	0.09	0.07	2.03	1.47
2.61	2.24	1.85	1.58	7.11	7.11
0.08	0.08	0.08	0.08	18.22	18.22
6.39	6.65	5.47	5.27	28.87	28.87
8.59	9.13	9.18	9.04	35.20	35.20
8.91	9.60	9.63	9.71	36.31	36.31
17.03	16.17	14.70	13.86	47.87	47.87

完全なモデル（V + L + K）は、新しいデータセットでフレーズ検出、関係検出、述語検出のいずれも従来手法を大幅に上回る。
言語 priors と類似性埋め込みを用いると、ゼロショットの視覚的関係検出が改善される（K term）。
言語 priors により、少ない例から数千の関係へスケーリングし、評価時にゼロショットを可能にする。
Visual Phrases データセットでは、完全なモデルがより高い mAP と強いリコールを達成し、埋め込みベースの priors の利点を示す。
予測された関係を用いることで画像ベースの検索が改善され、Baseline より Recall@1 が高く、中央値の順位が低くなる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。