QUICK REVIEW

[논문 리뷰] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari, Liangzhe Yuan|arXiv (Cornell University)|2021. 04. 22.

Human Pose and Action Recognition참고 문헌 105인용 수 340

한 줄 요약

VATT는 멀티모달 대조 학습 손실을 사용하여 원시 비디오, 오디오, 텍스트에 대해 컨볼루션 없이 Transformer를 훈련시키고, 감독된 사전 학습 없이 비디오 액션 인식 및 오디오 이벤트 분류에서 최첨단 결과를 달성합니다.

ABSTRACT

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600, 72.7% on Kinetics-700, and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training. VATT's source code is publicly available.

연구 동기 및 목표

대규모 멀티모달 비디오 데이터를 활용하여 라벨링된 데이터 의존성을 피하려는 동기 부여.
원시 비디오, 오디오, 텍스트 입력을 처리하는 컨볼루션 없는 Transformer 아키텍처를 개발.
교차 모달 정렬을 위한 계층적 공통 공간을 갖춘 멀티모달 대조 학습 목표를 제안.
학습 representations를 비디오 액션 인식, 오디오 이벤트 분류, 이미지 분류, 텍스트-비디오 검색에서 평가.

제안 방법

비디오, 오디오, 텍스트 입력에 대해 모달리티별 토큰화와 분리된 위치 인코딩 사용.
시퀀스 표현을 위한 집계 토큰이 있는 컨볼루션 없는 Transformer 백본 채택.
DropToken 도입: 학습 중 임의의 토큰 부분집합을 드랍하여 계산량 감소.
NCE 및 MIL-NCE 손실을 통해 비디오, 오디오, 텍스트를 정합시키기 위한 g 투영으로 의미적으로 계층적 공통 공간 구성.
HowTo100M(video-audio-text) 및 AudioSet(video-audio)에서 멀티모달 대조 학습 목표로 학습.
필요 시 모달리티 간 가중치를 공유하여 모달리티에 구애받지 않는 백본(VATT-MA)을 형성.

실험 결과

연구 질문

RQ1단일 컨볼루션 없는 Transformer 백본이 자기지도 멀티모달 목표를 통해 원시 비디오, 오디오, 텍스트를 학습할 수 있는가?
RQ2모달리티-아그노스틱 변환기가 작업 전반에서 모달리티별 백본과 비슷한 성능을 낼 수 있는가?
RQ3DropToken이 고해상도 멀티모달 데이터의 학습 효율성과 다운스트림 성능에 미치는 영향은 무엇인가?
RQ4VATT 표현의 이미지 분류 및 제로샷 텍스트-비디오 검색으로의 전이 능력은 어떠한가?

주요 결과

방법	Kinetics-400 Top-1	Kinetics-400 Top-5	Kinetics-600 Top-1	Kinetics-600 Top-5	Moments in Time Top-1	Moments in Time Top-5	TFLOPs
VATT-Base	79.6	94.9	80.5	95.5	38.7	67.5	9.09
VATT-Medium	81.1	95.6	82.4	96.1	39.5	68.2	15.02
VATT-Large	82.1	95.5	83.6	96.6	41.1	67.7	29.80
VATT-MA-Medium	79.9	94.9	80.8	95.5	37.8	65.9	15.02

VATT는 감독 사전 학습 없이 파인튜닝한 경우 Kinetics-400에서 top-1 82.1%, Kinetics-600에서 83.6%를 달성하고 Moments in Time에서 41.1%를 달성한다.
VATT의 비전 백본은 다중 모달 데이터로 사전 학습되며 ImageNet으로 78.7% top-1 정확도로 이전 감독 기반 ViT 변형과 비슷한 수준으로 전달된다.
Audio Transformer은 AudioSet에서 미디어 평균 정확도(mAP) 39.4%를 달성, CNN 기반 기준선을 상회한다.
Zero-shot 텍스트-비디오 검색은 YouCook2 및 MSR-VTT에서 VATT의 비디오-텍스트 공간을 사용해 이전 멀티모달 방법에 비해 경쟁력 있는 결과를 보이며, 배치 크기와 에폭의 효과가 관찰된다.
모달리티-아그노스틱 백본(VATT-MA)은 파인튜닝 후 비디오 액션 인식에서 모달리티별 백본과 동등한 성능을 보인다.
DropToken은 사전 학습 계산을 크게 줄이면서 다운스트림 성능을 유지하여 고해상도 입력을 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.