QUICK REVIEW

[论文解读] Valley: Video Assistant with Large Language model Enhanced abilitY

Ruipu Luo, Ziwang Zhao|arXiv (Cornell University)|Jun 12, 2023

Multimodal Machine Learning Applications被引用 30

一句话总结

Valley 是一个多模态基础模型，通过一个简单的投影桥将视频、图像和语言融合在一起，以在大型语言模型骨干上实现视频为基础的指令跟随与对话。它使用一个两阶段的预训练和指令微调流程，配有一个 100k 视频指令数据集。

ABSTRACT

Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM's capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios. Our code and data are published anonymously at https://github.com/valley-vl/Valley.

研究动机与目标

动机：说明超越任务特定模型的通用视频为基础的多模态理解的必要性。
提出 Valley，一个通过投影层桥接的视频–图像–语言基础模型。
打造高质量的、由 ChatGPT 协助的指令数据集，以训练多任务视频理解。
采用两阶段训练流程（投影模块的预训练，然后联训对齐视觉-语言）。
展示 Valley 在视频问答和字幕基准测试中的零-shot 最新性能。

提出的方法

使用 ViT-L/14 (CLIP) 作为视觉编码器提取帧特征。
引入一个时序建模模块，提出三种结构（v1、v2、v3）来聚合时序信息。
在输入到 LLM（Stable-Vicuna）之前，通过一个简单的投影层将视觉和语言桥接。
构建一个 100k 视频指令数据集，包含由 ChatGPT 协助的提示，覆盖详细描述、对话和复杂推理。
两阶段训练：（1）在图像-文本和视频-文本对上对投影模块进行预训练；（2）在 234k 图像/视频指令数据上对投影和 LLM 进行端到端微调。
在多个视频问答和多模态基准测试中进行零-shot 和少-shot 设置的评估。

实验结果

研究问题

RQ1单一的多模态基础模型是否能够理解视频、图像和语言并通过自然语言进行交互？
RQ2简单的投影桥是否足以将视觉特征与 LLM 对齐以实现对视频为基础的稳健指令跟随？
RQ3与最先进基线相比，Valley 在零-shot/少-shot 视频问答、字幕和多模态推理上的表现如何？
RQ4不同时序建模策略对长视频与短视频理解的影响是什么？

主要发现

在已报道的方法中，Valley 在 MSVD-QA、MSRVTT-QA、ActivityNet-QA 基准上实现了最先进的零-shot 表现。
Valley-v3 在较长视频上表现出色（MSRVTT-QA 与 ActivityNet-QA），而 Valley-v1 在较短视频上表现最佳（MSVD-QA）。
在基于视频的生成基准中，Valley-v3 在正确性、上下文理解、时序理解和一致性方面领先。
Valley 在 ScienceQA 上展示了竞争力的推理链和少量样例能力，在特定情景下有时甚至优于 GPT-3.5。
提出的三种时序建模变体能够有效捕捉时序信息，v3 在较长序列方面显示出优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。