QUICK REVIEW

[논문 리뷰] MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu|arXiv (Cornell University)|2021. 06. 04.

Multimodal Machine Learning Applications참고 문헌 125인용 수 54

한 줄 요약

MERLOT은 6M YouTube 비디오에서 자기지도 목표를 사용해 다중모달 스크립트 지식을 학습하고, 12 video QA 과제에서 최첨단 결과를 달성하며 이미지로의 전이가 가능하다.

ABSTRACT

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.

연구 동기 및 목표

레이블링되지 않은 비디오-음성 데이터로 일반상식, 시간 추론, 그리고 다중모달 세계 지식을 학습하도록 동기를 부여한다.
비디오 프레임과 대본을 정렬하고 시간에 따라 맥락화하는 자기지도 사전학습 프레임워크를 개발한다.
다운스트림 비전-언어 과제에 대한 이전 가능성을 평가한다, including video QA and static image reasoning.
다양한 YouTube 기반 말뭉치와 오픈소스 모델을 공개하여 다중모달 시간적 추론에 대한 연구를 가능하게 한다.

제안 방법

6 million unlabeled YouTube 비디오의 연속된 비디오 세그먼트를 사용하여 MERLOT을 사전학습한다; 각 세그먼트에는 프레임과 발화된 대본 세그먼트가 포함된다.
프레임과 대본 세그먼트를 인코딩하고 교차 모달 표현을 학습하기 위해 비전-언어 결합 트랜스포머를 사용한다.
세 가지 사전학습 목표: (i) 프레임과 맥락화된 대본을 맞추는 대조적 프레임-대본 매칭, (ii) 어텐션 마스킹된 마스킹된 언어 모델링으로 구체적 단어를 재구성, (iii) 시간에 따른 이벤트 순서를 학습하기 위한 시간적 재정렬.
다양한 말뭉치인 YT-Temporal-180M에서 학습; 격자 기반 이미지 인코더(ResNet-50 + Vision Transformer)와 12-레이어 RoBERTa-스타일의 조인 인코더를 사용한다.
12개의 비디오 추론 벤치마크와 VCR(시각적 일반상식 추론)을 포함한 14개 데이터셋에 걸쳐 다운스트림 과제에 대해 파인튜닝한다.
연구 용도로 코드, 데이터 및 모델을 공개한다.

실험 결과

연구 질문

RQ1수동 주석 없이 unlabeled 비디오와 대본에서 다중모달 스크립트 지식을 학습할 수 있는가?
RQ2프레임 수준과 비디오 수준의 목표가 서로 보완하여 시간적 및 다중모달 추론을 향상시키는가?
RQ3비디오에서 학습된 표현이 시간적 또는 서사적 이해를 요구하는 정적 이미지 추론 과제에 얼마나 잘 이전되는가?
RQ4데이터 다양성, 사전학습 지속 시간, 목표 설계가 다운스트림 성능에 미치는 영향은 무엇인가?

주요 결과

모델	Spearman	Pairwise acc	거리
MERLOT (base-sized)	0.733	84.5	0.498
CLIP	0.609	78.7	0.638
UNITER	0.545	75.2	0.745

MERLOT은 미세조정 시 12개의 다운스트림 비디오 추론 과제에서 최첨단 성과를 달성한다.
Visual Commonsense Reasoning(VCR)에서 MERLOT는 80.6% 정확도를 달성하여 유사 기준선보다 3% 포인트 이상 우수하다.
표 1에서 MERLOT(base-sized)는 Spearman 0.733, Pairwise accuracy 84.5, Distance 0.498에 도달하여 CLIP(0.609, 78.7, 0.638) 및 UNITER(0.545, 75.2, 0.745)를 상회한다.
영상(이미지뿐만 아니라)에서의 사전학습과 영상 다양성의 증가가 성능을 향상시키며, 더 긴 사전학습이 지속적인 이득을 제공한다.
MERLOT은 정적 이미지로의 전이를 수행하며 시각적 이야기의 해체와 같은 작업에서 시간적 상식 추론을 보여주고, 이미지-대-설명 쌍에 의존하는 베이스라인보다 우수하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.