QUICK REVIEW

[论文解读] Support-set bottlenecks for video-text representation learning

Mandela Patrick, Po-Yao Huang|arXiv (Cornell University)|Oct 6, 2020

Multimodal Machine Learning Applications参考文献 96被引用 37

一句话总结

论文提出带有 support-set bottleneck 的跨实例字幕生成，以补充对比学习的视频文本学习，在多个数据集上提高语义共享和检索性能。

ABSTRACT

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval.

研究动机与目标

激励在超越严格实例判别的方向上改进视频-文本表示

提出的方法

将跨模态对比学习与生成型跨字幕目标相结合
引入一个跨实例注意力机制，利用批内其他视频的加权混合来重构字幕
定义一个 batch 级注意力以选择一个 support 集并形成重构文本表征
对视频-文本对使用基于铰链的三元组对比损失，以及带有可调权重 lambda 的跨字幕损失
对跨字幕注意力的变体（Identity、Full、Hybrid、Cross）进行实验并研究 support 集大小的影响
使用 Adam 训练并在微调其他模块时冻结视频编码器

实验结果

研究问题

RQ1生成式跨字幕目标是否能提升用对比损失学习的多模态表征？
RQ2从基于批次的 support 集重建字幕是否促进样本之间的语义共享？
RQ3哪个跨字幕变体在跨数据集上提供最佳检索性能？
RQ4support 集大小如何影响检索性能？
RQ5在 HowTo100M 上的预训练对最终结果有何影响？

主要发现

Cross 变体的跨字幕在 MSR-VTT 上实现了最佳文本到视频检索（R@1 27.2%，R@5 55.2%）及相关指标
在消融实验中，结合时序注意力、增强的文本编码/解码，以及基于三元组的对比损失，优于基线
在 HowTo100M 上的预训练进一步提升在 MSR-VTT、VATEX、ActivityNet 和 MSVD 的性能
跨字幕目标作为瓶颈，促进概念共享并提升语义检索
较小和过大的 support 集会降低性能，表明存在一个最佳中间大小
定性注意力分析显示模型关注语义相关的样本，而非记忆化独立的视频-字幕对

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。