QUICK REVIEW

[论文解读] Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Teng Wang, Jinrui Zhang|arXiv (Cornell University)|Mar 11, 2023

Multimodal Machine Learning Applications被引用 7

一句话总结

论文为未剪辑视频提出一个具备双向文本-事件定位和事件到文本生成的 grounded 视觉-语言框架，并结合语义感知的标签分配，在密集视频字幕生成上达到最新水平，在VL理解/生成任务上具竞争力。

ABSTRACT

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publicly available at https://github.com/zjr2000/GVL.

研究动机与目标

从未剪辑视频中学习辨别性、时间敏感的事件表征，使用自然语言监督。
开发两个双重前置任务（文本到事件定位和事件到文本生成），促进细粒度的视频-语言对齐。
通过语义感知标签分配策略应对边界标注噪声，以实现鲁棒的事件-句子匹配。
展示在多数据集上对 Visually-Grounded Language Understanding and Generation 任务的可扩展性。

提出的方法

将未剪辑视频编码为一组事件候选，通过带可学习查询的基于 Transformer 的事件检测器进行提取。
通过在一个联合的视觉-语言空间中计算跨模态相似性并应用对比损失，将事件与句子通过文本到事件定位进行对齐。
通过事件到文本生成模块从事件候选生成句子，该模块还预测时间边界和置信度。
使用语义感知标签分配，将基于跨模态相似性的语义相似度成本与定位成本结合，实现鲁棒的一对一匹配。
以一个联合优化 ETG 和 TEG 损失的综合目标进行训练，并在语义感知匹配的引导下进行。

实验结果

研究问题

RQ1双向监督（TEG 和 ETG）是否能够从未剪辑视频中产生更具辨别性和语义丰富性的事件表征？
RQ2与传统基于 IoU 的匹配相比，语义感知标签分配是否提升对噪声边界的鲁棒性？
RQ3在多种未剪辑视频数据集上的密集视频字幕生成以及各种 VL 理解/生成任务中，所提框架的表现如何？

主要发现

在 ActivityNet Captions、YouCook2、YouMakeup 上实现密集视频字幕生成的最新水平。
在其他语言生成与理解任务上展现出竞争力，包括视频段落字幕生成以及单句/多句视频定位。
由于语义感知匹配，优于基线并对边界噪声表现出鲁棒性。
双向前置任务提升了字幕生成质量和时间行动定位。
在 PIC 的 MTVG 和 MDVC 等具体挑战中获得第一名，凸显其实用效果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。