QUICK REVIEW

[論文レビュー] Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Ruichi Yu, Ang Li|arXiv (Cornell University)|Jul 28, 2017

Multimodal Machine Learning Applications参考文献 28被引用数 57

ひとこと要約

本論文は、内部および外部の言語知識を視覚的関係検出器へ蒸留する教師-生徒フレームワークを提案し、述語予測を改善する。特にゼロショットの場合に効果が高い。

ABSTRACT

Understanding visual relationships involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the (subj,obj) pair (both semantically and spatially) to predict the predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships, but complicates learning since the semantic space of visual relationships is huge and the training data is limited, especially for the long-tail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj,obj) pair. Then, we distill the knowledge into a deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the state-of-the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

研究の動機と目的

三要素 ⟨subject, predicate, object⟩ を結合モデリングして、視覚的関係を捉え予測する。
長尾および未見の関係に対処するため、言語知識で深層視覚モデルを正則化する。
知識蒸馏を通じて内部（トレーニング注釈）および外部（公開テキスト）の言語統計を活用する。

提案手法

主語と目的語の表現および空間配置と共に述語を結合的にモデリングする。
言語知識 P(pred|subj,obj) を用いて教師ネットワークを構築し、訓練中に生徒ネットワークへ蒸留する。
訓練注釈とWikipediaから言語知識を収集し、これらを組み合わせて教師の指導を形成する。
主語/目的語の意味的埋め込みと空間特徴を用いて述語確率を条件付けする。
地真実の監督と教師の指導を融合した損失（KL様の蒸留に類似）でエンドツーエンドに訓練する。
VRDとVisual GenomeデータセットでRecall@kを用いて評価し、ゼロショット分割を含む。

実験結果

リサーチクエスチョン

RQ1言語統計学（内部および外部）は深層視覚関係モデルを正則化して一般化を向上させるか？
RQ2教師ネットワークと生徒ネットワークを組み合わせることは、見出しデータとゼロショットの状況で性能にどう影響するか？
RQ3意味表現と空間表現が述語予測精度に与える影響は何か？
RQ4外部知識源（例：Wikipedia）は内部訓練データと統合したとき役に立つのか、それとも害になるのか？

主な発見

言語知識蒸留は、純粋にデータ主導のベースラインと比較して述語予測を大幅に改善する。
VRDのゼロショットリコールは LK蒸留で8.45%から19.17%に改善。
教師と生徒の予測を組み合わせた（T+S）は最良の結果を示し、見出しおよびゼロショットの設定でベースラインを上回る。
主語/目的語の意味表現と空間特徴を用いることで予測力と一般化能力が向上する。
外部知識だけはノイズが多い場合があるが、内部知識および視覚データと組み合わせると LK蒸留は依然有益である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。