QUICK REVIEW

[논문 리뷰] VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Sihan Chen, Handong Li|arXiv (Cornell University)|2023. 05. 29.

Multimodal Machine Learning Applications인용 수 28

한 줄 요약

본 논문은 VAST-27M이라는 대규모 옴니-모달리티 비디오-자막 데이터셋과 시각, 음향, 자막, 텍스트를 공동으로 모델링하는 VAST 기초 모델을 소개하며, 비전-텍스트, 오디오-텍스트, 다중 모달 비디오-텍스트 태스크에서 최첨단 성능을 달성한다.

ABSTRACT

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

연구 동기 및 목표

전통적인 비전-텍스트 모델을 넘어 비전, 음향, 자막을 활용하여 옴니-모달리티 비디오 이해를 촉진한다.
시각, 음향, 자막 자막 생성을 자동화하여 확장 가능한 옴니-모달리티 자막 데이터셋을 만든다.
네 가지 모달리티를 처리하고 융합하는 통합 기초 모델을 훈련시켜 다양한 다운스트림 태스크(검색, 자막 생성, 질의응답)에 적용한다.
옴니-모달리티 사전학습이 기존의 크로스-모달리티 방법들보다 크로스-모달 벤치마크를 개선함을 입증한다.

제안 방법

VAST-27M을 생성하기 위한 이단계 자동 파이프라인: 별도의 비전 캡션러와 오디오 캡션러를 훈련시키고, 그런 다음 LLM을 사용하여 단일 모달리티 자막 및 자막에서 옴니-모달리티 자막을 생성한다.
27M 개의 비디오 클립으로 VAST-27M을 구성하고 클립당 11개의 자막(비전 5, 오디오 5, 옴니-모달리티 1).
ViT(비전) 인코더, BEATs(오디오) 인코더, BERT(텍스트) 인코더 및 융합용 교차 주의(attention)를 갖춘 1.3B 파라미터 트랜스포머 기반 기초 모델 VAST를 제안한다.
세 가지 옴니-모달리티 목표로 학습한다: OM-VCC(대조 학습), OM-VCM(매칭), OM-VCG(옴니-모달리티 자막 생성).
사전학습과 미세조정 중 모달리티 그룹화로 다운스트림 작업에서 누락된 모달리티를 처리한다.

실험 결과

연구 질문

RQ1옴니-모달리티 비디오-자막 코퍼스가 비전-텍스트 모델을 넘어서 크로스-모달 이해를 향상시킬 수 있는가?
RQ2통합 비전-오디오-자막-텍스트 기초 모델이 다양한 벤치마크에서 검색, 자막 생성, QA 태스크에 걸쳐 일반화되는가?
RQ3대규모 옴니-모달리티 사전학습과 LLM 기반 자막 통합이 다운스트림 성능에 미치는 영향은 무엇인가?
RQ4VAST-27M이 품질과 규모 측면에서 기존의 크로스-모달 코퍼스와 어떻게 비교되는가?
RQ5각 모달리티와 옴니-모달리티 목표의 중요성을 어떤 차등 실험이 드러내는가?

주요 결과

VAST는 크로스-모달 벤치마크에서 22개의 새로운 최첨단 성능을 달성한다.
VAST는 검색, 자막 생성, QA에 걸쳐 비전-텍스트, 오디오-텍스트 및 다중 모달 비디오-텍스트 태스크에서 이전 모델을 능가한다.
VAST-27M으로의 옴니-모달리티 사전학습은 V-T 및 A-T 설정에서 다양한 오픈 소스 코퍼러스에 비해 상당한 이점을 제공하고 OMV-OMC 정렬을 향상시킨다.
단일 모달리티 자막에서 옴니-모달리티 자막을 생성하기 위해 LLM을 사용하는 것이 단순 자막 연결보다 더 나은 성능을 얻는다.
모델 비교는 MSRVTT, YouCook2, VATEX, VALOR-32K 등과 같은 데이터셋에서 강력한 성능을 보여주며 종종 SOTA 기준선에 비해 상당한 이득을 가진다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.