QUICK REVIEW

[论文解读] CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Han Fang, Pengfei Xiong|arXiv (Cornell University)|Jun 21, 2021

Multimodal Machine Learning Applications参考文献 38被引用 130

一句话总结

CLIP2Video 将来自 CLIP 的图像-语言预训练转移到视频-文本检索，使用两个模块： Temporal Difference Block 和 Temporal Alignment Block，在 MSR-VTT、MSVD 和 VATEX 上取得了最先进的结果。

ABSTRACT

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

研究动机与目标

将视频-文本检索重新表述为两个独立的问题：图像-文本多模态学习与视频帧与文本之间的时间关系。
利用一个预训练的图像-语言模型（CLIP）以便在相对较小的数据集上实现端到端训练。
引入两个时序模块以捕捉运动并将视频片段与上下文词汇对齐，以提高跨模态检索性能。

提出的方法

对图像-文本嵌入采用基于 CLIP 的初始化，并对视频帧进行分离的时序建模。
Temporal Difference Block (TDB) 在相邻帧嵌入之间插入与运动相关的 tokens，以增强运动表示。
Temporal Alignment Block (TAB) 学习 K 个共享中心，在联合空间中对齐帧嵌入和词嵌入，并基于运动相关性进行再加权。
聚合全局表示 (f^g) 与对齐表示 (f^a) 以用于对称对比损失。
使用视频-文本对的对称交叉熵损失进行训练，并将最终相似度计算为 g-嵌入和 a-嵌入的平均值。

实验结果

研究问题

RQ1图像-语言预训练如何有效地迁移到视频-文本检索？
RQ2是否可以显式建模时序信息以在没有大规模视频-语言预训练的情况下改善视频-文本对齐？
RQ3时序差分和对齐模块是否在标准基线上有可衡量的提升？
RQ4对齐中心数量对检索性能有何影响？
RQ5全局表示与对齐表示应如何在推理中组合？

主要发现

在 MSR-VTT、MSVD 以及 VATEX 上实现文本到视频和视频到文本检索的最新结果。
Temporal Difference Block 通过在时序处理前注入与运动相关的 tokens 显著提升性能。
Temporal Alignment Block，通过共享中心，改善视频帧与上下文词之间的跨模态对齐，带来进一步提升。
采用全局表示与对齐表示的平衡组合（w = 0.5）可获得最佳检索性能。
在 MSR-VTT (1k-A 协议) 上，我们的方法在 Text→Video R@1 为 45.6，Video→Text R@1 为 43.5（数值来自 Table 3）。
在 MSR-VTT (1k-A 协议) 上，我们的方法在 Text→Video MdR 为 2.0，Video→Text MdR 为 2.0（数值来自 Table 3）。
在 VATEX 上，我们的方法实现了超越若干基线的强检索性能（Tables 4-5）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。