QUICK REVIEW

[論文レビュー] MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu|arXiv (Cornell University)|Jun 4, 2021

Multimodal Machine Learning Applications参考文献 125被引用数 54

ひとこと要約

MERLOTは自己監督学習の目的を用いて6M本のYouTube動画からマルチモーダルなスクリプト知識を学習し、12の動画QAタスクで最先端の結果を達成し、画像へ転移する。

ABSTRACT

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.

研究の動機と目的

ラベルのない動画と音声データから常識的知識、時系列推論、およびマルチモーダルな世界知識の学習を動機づける。
動画フレームと文字起こしを整列させ、時間とともに文脈化する自己教師あり事前学習フレームワークを開発する。
ビデオQAや静的画像推論を含む下流の視覚−言語タスクへの転移可能性を評価する。
マルチモーダルな時系列推論の研究を可能にするため、さまざまなYouTube由来コーパスとオープンソースモデルを公開する。

提案手法

6 million unlabeled YouTube動画の連続セグメントを用いてMERLOTを事前学習する。各セグメントには1フレームと音声起こしのセグメントが含まれる。
フレームと転写セグメントをエンコードし、クロスモーダル表現を学習するために、視覚−言語の結合トランスフォーマーを使用する。
3つの事前学習目的: (i) フレームと転写の対照学習マッチングでフレームを文脈化された転写と整列させる, (ii) アテンションマスク付きマスクド言語モデリングで根拠となる語を再構成する, (iii) 時間的並べ替えによってイベントの時系列順を学習する。
多様なコーパスであるYT-Temporal-180Mで訓練する; グリッドベースの画像エンコーダ（ResNet-50 + Vision Transformer）と12層のRoBERTa風ジョインエンコーダを用いる。
下流タスクの14データセット全体でファインチューニングする。うち12の動画推論ベンチマークとVCR（visual commonsense reasoning）を含む。
研究利用のためにコード、データ、およびモデルを公開する。

実験結果

リサーチクエスチョン

RQ1手動アノテーションなしで、ラベルなしの動画と文字起こしからマルチモーダルなスクリプト知識を学習できるか？
RQ2フレームレベルの目的とビデオレベルの目的は、時系列推論とマルチモーダル推論を改善するために互いに補完し合うか？
RQ3動画で学習した表現は、時系列または物語理解を要する静的画像推論タスクへどの程度転移するか？
RQ4データの多様性、事前学習期間、目的設計が下流の性能に与える影響はどの程度か？

主な発見

モデル	Spearman	ペアワイズ精度	距離
MERLOT (base-sized)	0.733	84.5	0.498
CLIP	0.609	78.7	0.638
UNITER	0.545	75.2	0.745

MERLOTはファインチューニング時に、12の下流の動画推論タスクで最先端の結果を達成する。
Visual Commonsense Reasoning (VCR)では、MERLOTは80.6%の精度を達成し、比較可能なベースラインを3%以上上回る。
Table 1では、MERLOT (base-sized) は Spearman 0.733, Pairwise accuracy 84.5, and Distance 0.498 に到達し、CLIP (0.609, 78.7, 0.638) および UNITER (0.545, 75.2, 0.745) を上回る。
動画での事前学習（画像だけでなく）を行い、動画の多様性を高めることでパフォーマンスが向上し、長い事前学習は継続的な改善をもたらす。
MERLOTは静的画像へ転移し、視覚的な物語を並べ替えるような時系列的常識推論のタスクで、画像とキャプションのペアに依存するベースラインを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。