QUICK REVIEW

[論文レビュー] VD-BERT: A Unified Vision and Dialog Transformer with BERT

Yue Wang, Shafiq Joty|arXiv (Cornell University)|Apr 28, 2020

Multimodal Machine Learning Applications参考文献 60被引用数 30

ひとこと要約

VD-BERT は BERT を基盤とした単一ストリームの Vision-Dialog Transformer を導入し、画像内容と複数ターンの対話を共同でモデル化することで、外部の視覚言語事前学習なしに VisDial で最先端の NDCG を達成します。

ABSTRACT

Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at https://github.com/salesforce/VD-BERT.

研究の動機と目的

画像内容と対話履歴の統合を必要とする多ターン推論タスクとして、ビジュアルダイアログを動機づける。
Visual Dialog における識別的（ランキング）と生成的（生成）タスクの両方を処理する統一された Transformer モデルを提案する。
BERT を用いた視覚的 grounding を伴う学習が、大規模な外部視覚言語事前学習なしで最先端の結果を生み出せることを示す。

提案手法

画像をオブジェクトレベルの特徴としてエンコードし、キャプションと多ターンの対話とを、BERT で初期化された単一の Transformer エンコーダーで融合する。
視覚的に基づく訓練目的（Masked Language Modeling と Next Sentence Prediction）を、二つの自己注意マスク（双方向と seq2seq）を用いて、識別的と生成的の両方の設定を可能にする。
各回答候補を入力に追加し、シーケンス中の他のエンティティとの早期フュージョンを実現する。
識別的訓練では NSP スコアで候補をランキングし、生成的訓練では適切なマスキングを用いて同じエンコーダーで自回帰的に回答を生成する。
ランキング損失（ListNet）を用いて、密な関連注釈でファインチューニングしてランキング品質を向上させる。

実験結果

リサーチクエスチョン

RQ1単一の統一された Transformer エンコーダーは、ビジュアルダイアログにおいて画像オブジェクト、対話履歴、候補回答間の双方向相互作用を効果的にモデル化できるか？
RQ2別個のデコーダーや外部の視覚言語事前学習を用がなく、BERT ベースのモデルを識別的（ランキング）と生成的（生成）な VisDial タスクの両方に訓練することは可能か？
RQ3視覚的に基づく MLM と NSP の目的は、視覚と対話モダリティの融合にどのような影響を与えるか？

主な発見

VD-BERT は単一モデル設定で VisDial v1.0 test-std の新しい最先端結果を達成（NDCG 74.54）、アンサンブルでは NDCG 75.35。
VD-BERT は識別タスクで従来の単一モデルのベースラインを上回り、外部の視覚言語事前学習なしで競争力のある生成結果を提供する。
Dense アノテーションによるファインチューニングは NDCG を大幅に向上させる（例: 59.96 から 74.54 へ）が、MRR や R@k など他の指標を低下させる可能性があり、指標間の一貫性の問題を示している。
BERT からの初期化は、最初から訓練する場合より大きな利益をもたらす； MLM による視覚 grounding は多モーダル転移にとって重要である。
二つの自己注意マスクを持つ統一型 Transformer は、明示的なデコーダーなしで識別的および生成的な VisDial 設定の両方をサポートできる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。