QUICK REVIEW

[论文解读] MomentDiff: Generative Video Moment Retrieval from Random to Real

Pandeng Li, Chen-Wei Xie|arXiv (Cornell University)|Jul 6, 2023

Multimodal Machine Learning Applications被引用 23

一句话总结

MomentDiff 引入一种基于扩散的生成框架用于视频时刻检索，从随机片段开始，迭代地在文本-视频相似性引导下 refined 成准确的时间边界，减少对数据集定位偏差的依赖。

ABSTRACT

Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. Different from discriminative works (e.g., based on learnable proposals or queries), MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two anti-bias datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets are available at https://github.com/IMCCretrieval/MomentDiff.

研究动机与目标

为视频时刻检索（VMR）提供生成式视角，以减轻数据集中的时间位置偏差。
提出 MomentDiff，一种基于扩散的框架，通过文本-视频相似性引导将随机片段映射到真实时刻。
展示在多个公开数据集和 anti-bias 基准上的效率、泛化性和鲁棒性。
提供 anti-bias 评估数据集，以研究位置分布变化对 VMR 模型的影响。

提出的方法

通过相似性感知条件化，用多模态嵌入表示视频和文本。
使用相似性损失将粗粒度的文本-视频关系编码到融合嵌入中，指导生成。
引入 Video Moment Denoiser（VMD），通过逆向扩散过程将随机片段转换为真实时刻。
通过用高斯噪声污染真实片段来建模前向扩散，并学习一个去噪网络以预测片段坐标和置信度。
通过简单线性投影（FC）实现片段嵌入，而不是基于 ROI 的特征，以避免提议瓶颈。
使用包含 L1、IoU 和交叉熵项的综合损失进行训练，并加上相似性损失以对齐多模态特征。

实验结果

研究问题

RQ1基于扩散的生成器能否在不依赖预定义提议的情况下，从随机初始化中学习生成准确的视频时刻片段？
RQ2MomentDiff 相较于判别式 VMR 方法，是否提供更好的泛化性和对时间位置偏差的鲁棒性？
RQ3反偏见定位位移数据集（Charades-STA-Len 和 Charades-STA-Mom）如何影响 VMR 性能和模型鲁棒性？
RQ4设计选择（片段嵌入、扩散尺度、VMD、损失设计）对 VMR 性能有何影响？

主要发现

方法	特征	R1@0.5	R1@0.7	MAP@0.5	MAP@0.75	MAP_avg
MomentDiff	SF+C, Glove	55.57	32.42	61.07	32.51	32.85

MomentDiff 在 Charades-STA、QVHighlights 和 TACoS 上始终优于最先进方法，且在多种特征设置下表现良好。
相较于 MomentDETR，MomentDiff 在 anti-bias 数据集 Charades-STA-Len 和 Charades-STA-Mom 上显示出更强的鲁棒性和泛化性。
在 SF+C 特征和基于 CLIP 的文本条件下，MomentDiff 在 Charades-STA 上达到顶尖性能（例如 R1@0.5 高达 55.57，MAP_avg 32.85）。
消融研究显示 FC 片段嵌入优于 ROI 基嵌入，结合 VMD 与噪声条件的扩散提升结果，且平衡的一组损失有益。
迁移实验表明 MomentDiff 在跨领域设置（Charades-CD 和 ActivityNet-CD）中保持强劲性能，超出其训练分布。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。