QUICK REVIEW

[论文解读] VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Muhammad Maaz, Hanoona Rasheed|arXiv (Cornell University)|Jun 13, 2024

Human Pose and Action Recognition被引用 7

一句话总结

VideoGPT+ 将图像编码器和视频编码器结合，进行分段采样和视觉适配器，以提升视频理解，在 VCGBench、VCGBench-Diverse、MVBench 和零样本问答上取得强劲结果，并引入 VCG+112K 与 VCGBench-Diverse 基准。

ABSTRACT

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced Large Language Models (LLMs), they rely on either image or video encoders to process visual inputs, each of which has its own limitations. Image encoders excel at capturing rich spatial details from frame sequences but lack explicit temporal context, which can be important in videos with intricate action sequences. On the other hand, video encoders provide temporal context but are often limited by computational constraints that lead to processing only sparse frames at lower resolutions, resulting in reduced contextual and spatial understanding. To this end, we introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling). The model processes videos by dividing them into smaller segments and applies an adaptive pooling strategy on features extracted by both image and video encoders. Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering. Further, we develop 112K video-instruction set using a novel semi-automatic annotation pipeline which further improves the model performance. Additionally, to comprehensively evaluate video LMMs, we present VCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports, science, gaming, and surveillance videos. This benchmark with 4,354 question-answer pairs evaluates the generalization of existing LMMs on dense video captioning, spatial and temporal understanding, and complex reasoning, ensuring comprehensive assessment across diverse video types and dynamics. Code: https://github.com/mbzuai-oryx/VideoGPT-plus.

研究动机与目标

动机：利用来自图像编码器的空间细节和来自视频编码器的时序上下文，推动双编码器融合。
提出分段采样以捕捉细粒度的时序动态。
引入视觉适配器，将图像和视频特征投影并对齐到语言空间。
创建高质量、密集的视频描述和问答数据（VCG+ 112K）以及多样化的基准（VCGBench-Diverse）以提升评估。

提出的方法

使用双编码器：一个用于丰富空间特征的图像编码器（预训练），一个用于全局时序上下文的视频编码器（预训练）。
应用分段采样将视频分成 K 个段并在每个段内处理。
通过视觉-语言适配器（可训练）和 2x2 自适应令牌池化将图像和视频特征投影到语言空间，以减少序列长度。
将图像和分段视频嵌入与文本嵌入拼接后输入到冻结的大型语言模型，使用 LoRA 进行微调。
分两阶段训练：在 CC-595K 上对仅图像和仅视频的适配器进行预训练，然后在组合特征（4K 上下文）上使用 LoRA 进行指令微调。
在 VCGBench、VCGBench-Diverse、MVBench 上评估，以及零样本 QA；对 VCGBench/VCGBench-Diverse 使用 16 帧，对 MVBench 使用 8 帧。

实验结果

研究问题

RQ1双编码器（图像+视频）相较于单编码器基线对视频对话性能有何影响？
RQ2分段采样是否比均匀采样更好地保留面向大语言模型的时序动态？
RQ3视觉-语言适配器和聚合/池化策略对将视觉特征与语言模型对齐的影响？
RQ4VideoGPT+ 变体在多样化视频域（VCGBench-Diverse）和零样本问答设置中的泛化能力如何？

主要发现

VideoGPT+ 在 VCGBench 的平均分为 3.28，超过了此前的最先进方法。
在 VCGBench-Diverse 上，VideoGPT+ 平均为 2.47，在空间理解和视觉推理方面有显著提升。
MVBench 结果显示 VideoGPT+ 在 20 项任务中平均为 58.7%，在若干具体任务（如 Action Prediction、Moving Count、Moving Attributes）有所提升。
在零样本 QA 中，VideoGPT+ 在 MSVD-QA、MSRVTT-QA、TGIF-QA、ActivityNet-QA 数据集上优于 prior 方法（如 MSVD-QA 准确率 72.4，分数 3.9）。
消融研究表明双编码器优于单编码器设置（双编码器：3.28 vs 图像仅 3.17 和视频仅 3.20）。
VCG+ 112K 通过其改进的密集描述和问答数据注释流水线提升 DO 与 TU。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。