QUICK REVIEW

[论文解读] VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Xingjian He, Sihan Chen|arXiv (Cornell University)|May 22, 2023

Multimodal Machine Learning Applications被引用 11

一句话总结

VLAB 将基于 CLIP 的图像-文本表示迁移到视频-语言预训练中，采用特征自适应和特征融合，使单一模型可适用于生成式与对比式的视频-语言任务，并在若干基准测试中取得领先结果。

ABSTRACT

Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 61.0, and 79.0, respectively. Codes and models will be released.

研究动机与目标

研究如何将图像-文本模型（如 CLIP）用于统一的视频-语言预训练。
开发一个视频适配器，以捕捉时序动态并实现生成任务。
提出一种特征融合机制，在单一模型中融合图像与视频特征。
证明 VLAB 在视频字幕、VQA 以及文本-视频检索基准上的有效性。

提出的方法

在 CLIP 的视觉编码器中引入一个视频适配器，以建模时序信息并实现生成任务。
分两阶段训练：自适应迁移（冻结 CLIP，除了适配器）和集成微调（所有参数可训练）。
开发两种特征融合策略（堆叠和并行）在多模态编码器中融合图像与视频特征。
使用联合损失 L = L_vtc + L_mlm + L_uni-lm 进行优化，以同时支持对比和生成任务。

实验结果

研究问题

RQ1是否可以将 CLIP 表征有效迁移到视频-语言预训练，以在多任务中形成统一模型？
RQ2如何在不遗忘先前 CLIP 知识的情况下，将时序动态整合到图像-文本模型中？
RQ3哪种融合策略能够最好地将来自图像和视频的特征结合用于视频-语言任务？
RQ4特征自适应与融合是否在视频字幕、VQA 和检索基准上带来改进？

主要发现

VLAB 1.6B 参数在视频问答上达到 49.6（MSR-VTT）、61.0（MSVD）和 79.0（TGIF），超越了 GiT2 等先前方法如 Flamingo。
VLAB-L（0.9B）在视频问答基准上超过使用更大模型/数据的最先进方法；VLAB-G 在 MSRVTT、MSVD 和 TGIF 上创造新纪录。
视频适配器提升性能并随 Webvid10M 数据扩展，尤其是在较大数据集上训练时。
两种跨注意力融合策略（并行和堆叠）有效融合图像与视频特征，共享跨注意力权重在内存上更高效且有效。
自适应迁移 + 集成微调的结果优于视频适配器的单阶段训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。