QUICK REVIEW

[论文解读] Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Xing Cheng, Hezheng Lin|arXiv (Cornell University)|Sep 9, 2021

Multimodal Machine Learning Applications参考文献 38被引用 66

一句话总结

提出 CAMoE，一种带有混合专家的多流语料库对齐网络，以及用于解决视频-文本检索中内容异质性的 Dual Softmax Loss，在 MSR-VTT、MSVD 和 LSMDC 上达到 SOTA。

ABSTRACT

Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module. DSL is proposed to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match. DSL is easy to implement with only one-line code but improves significantly. The results show that the proposed CAMoE and DSL are of strong efficiency, and each of them is capable of achieving State-of-The-Art (SOTA) individually on various benchmarks such as MSR-VTT, MSVD, and LSMDC. Further, with both of them, the performance is advanced to a big extend, surpassing the previous SOTA methods for around 4.6\% R@1 in MSR-VTT.

研究动机与目标

通过将视觉信息和语义信息分解为多条流来解决 VTR 中视频和文本之间的异质性。
引入 CAMoE（多流混合专家）以学习多样的跨模态表示。
提出 Dual Softmax Loss 以强制双向最优匹配并减少对比学习中的单向最优问题。
展示 CAMoE 与 DSL 单独及联合地在标准基准上提升 SOTA。
开展消融实验以理解设计选择，并为未来跨模态预训练模型提供指南。

提出的方法

CAMoE 使用多个专家（融合、实体、动作），并配有门控以融合对齐到相应文本层面的多视角视频表示。
句子生成策略（RKW、AKWE、MUW）将文本转化为语义聚焦的输入；实验中选择 MUW。
三种视觉帧聚合方案（均值池化、se-attention、self-attention）与不同的专家/门控结合以提高效率和性能。
Dual Softmax Loss 通过引入跨方向先验 Pr 来修订标准对称交叉熵，使相似度矩阵偏向对角线（真实匹配）。
DSL 使用温度缩放的相似度计算 Pr，并裁剪损失以偏好 Text-to-Video 与 Video-to-Text 的互高分数，且通过一行代码实现集成。
实验使用基于 CLIP 的特征（Bert、ViT）及在 MSR-VTT、MSVD、LSMDC 上的标准训练协议。

实验结果

研究问题

RQ1相比单流或双流模型，是否多流、基于专家的架构能够更好地对齐视频和文本内容？
RQ2双向最优匹配假设与所提出的 Dual Softmax Loss 是否通过纠正文本与视频之间的非对称匹配来提升检索准确性？
RQ3句子生成策略和视觉帧聚合选择对性能的影响到何种程度？
RQ4当与其他方法搭配以及跨数据集时，CAMoE 的泛化能力如何？
RQ5消融研究为设计未来跨模态预训练架构提供了哪些洞见？

主要发现

CAMoE（不含 DSL）在若干基准上设定了新的 SOTA，并通过将任务分解为专门化的专家来提升鲁棒性。
在 DSL 的加持下，CAMoE 进一步提升，特别是在 MSR-VTT 的 R@1 上相对于以前的 SOTA 实现了约 ~4.6% 的绝对提升。
DSL 对 Video-to-Text 的提升大于 Text-to-Video，解决文本描述可能不具体导致的内容异质性问题。
在 MSR-VTT、MSVD 和 LSMDC 上，CAMoE 与 DSL 单独及联合地在 R@1、R@5、R@10 上带来显著性能提升，同时降低平均秩。
消融研究表明，具有不同标题的多任务输入与选择性门控相较于单任务或全门控配置有提升。
将 DSL 应用于基于 CLIP 的方法时，性能持续提升，显示该方法的广泛适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。