QUICK REVIEW

[论文解读] VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Sihan Chen, Handong Li|arXiv (Cornell University)|May 29, 2023

Multimodal Machine Learning Applications被引用 28

一句话总结

本文提出 VAST-27M，一个大型全模态视频-字幕数据集，以及共同建模视觉、音频、字幕和文本的 VAST 基础模型，在视觉-文本、音频-文本和多模态视频-文本任务上取得最新最优结果。

ABSTRACT

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

研究动机与目标

通过利用视觉、音频和字幕，推动全模态视频理解，超越传统的视觉-文本模型。
通过自动化生成视听字幕字幕来创建可扩展的全模态字幕数据集。
训练一个处理并融合四种模态的统一基础模型，以支持多种下游任务（检索、字幕生成、问答）。
证明全模态预训练在跨模态基准测试中优于先前的跨模态方法。

提出的方法

用两阶段自动化管线创建 VAST-27M：分别训练视觉字幕生成器和音频字幕生成器，然后使用大型语言模型从单模态字幕和字幕生成全模态字幕。
以 27M 个视频片段构建 VAST-27M，每个片段含 11 条字幕（5 条视觉、5 条音频、1 条全模态）。
提出 VAST，一个 1.3B 参数的基于 Transformer 的基础模型，具备视觉（ViT）、音频（BEATs）和文本（BERT）编码器，以及用于融合的跨注意力。
以三个全模态目标进行训练：OM-VCC（对比）、OM-VCM（匹配）和 OM-VCG（全模态字幕生成）。
在预训练和微调阶段采用模态分组，以在下游任务中处理缺失模态的问题。

实验结果

研究问题

RQ1全模态视频-字幕语料库是否能够提升跨模态理解，超越视觉-文本模型？
RQ2一个统一的视觉-音频-字幕-文本基础模型是否能在多样化基准上对检索、字幕生成和问答任务实现泛化？
RQ3大规模全模态预训练和基于LLM的字幕集成对下游性能有何影响？
RQ4VAST-27M 在质量和规模上与现有跨模态语料库相比如何？
RQ5哪些消融实验揭示了各模态以及全模态目标的重要性？

主要发现

VAST 在跨模态基准测试上取得了 22 项新的最先进结果。
VAST 在视觉-文本、音频-文本以及多模态视频-文本任务的检索、字幕生成和问答等方面超越了先前的模型。
使用 VAST-27M 的全模态预训练在视觉-文本和音频-文本设置上相对于各种开源语料库带来显著提升，并改善 OMV-OMC 对齐。
使用 LLM 从单模态字幕生成全模态字幕的效果优于简单的字幕拼接。
模型比较在 MSRVTT、YouCook2、VATEX、VALOR-32K 等数据集上表现强劲，且通常比 SOTA 基线有显著提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。