[论文解读] Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video
本论文提出一个两阶段的弱监督时序定位方法,先通过多尺度滑窗和 MIL 选择粗略视频片段,然后通过细粒度的帧-句子交互和基于分水岭的分组来精确到帧边界。
In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal annotation during training. We propose a two-stage model to tackle this problem in a coarse-to-fine manner. In the coarse stage, we first generate a set of fixed-length temporal proposals using multi-scale sliding windows, and match their visual features against the sentence features to identify the best-matched proposal as a coarse grounding result. In the fine stage, we perform a fine-grained matching between the visual features of the frames in the best-matched proposal and the sentence features to locate the precise frame boundary of the fine grounding result. Comprehensive experiments on the ActivityNet Captions dataset and the Charades-STA dataset demonstrate that our two-stage model achieves compelling performance.
研究动机与目标
- 推动减少对标注视频中句子时序的高成本注释的依赖。
- 在训练阶段不依赖时序注释的情况下,对与查询语义匹配的视频片段进行定位。
- 开发一个粗到细的框架,以实现精确的起始/结束时间戳。
- 利用 MIL 从视频-句子对和滑窗提案中学习。
- 在 ActivityNet Captions 与 Charades-STA 上进行评估以证明有效性。
提出的方法
- 在 GloVe 嵌入后使用双向 LSTM 编码句子。
- 结合帧特征和 Bi-LSTM 进行上下文建模来编码视频。
- 通过多尺度滑窗(80% 重叠)生成固定长度的时序提案。
- 粗略阶段:使用双流定位器(分类 + 选择)计算融合的多模态分数,并进行 MIL 训练。
- 细化阶段:扩展粗略片段,与句子进行帧级交互,并预测逐帧分数;应用基于分水岭的分组以获得精确边界。
- 分两阶段训练:先进行带 MIL 损失的粗略阶段,然后进行带排序损失的细化阶段,以将正确的视频-句子对与错误的分离。
实验结果
研究问题
- RQ1在没有时序注释的前提下,弱监督是否能达到有竞争力的时序定位性能?
- RQ2粗到细策略是否在边界精度上优于单阶段方法?
- RQ3与提案级(粗略)推理相比,帧级细粒度交互如何影响定位准确性?
主要发现
| 方法 | R@1 IoU=0.1 | R@1 IoU=0.3 | R@1 IoU=0.5 | mIoU |
|---|---|---|---|---|
| ActivityNet Captions - CTRL (fully-supervised) | 49.1 | 28.7 | 14.0 | 20.5 |
| ActivityNet Captions - Yuan et al. (fully-supervised) | 73.3 | 55.7 | 36.8 | 37.0 |
| ActivityNet Captions - Xu et al. (fully-supervised) | - | 45.3 | 27.7 | - |
| ActivityNet Captions - He et al. (fully-supervised) | - | - | 36.9 | - |
| ActivityNet Captions - Mithun et al. (weakly-supervised) | 62.7 | 42.0 | 23.3 | 28.2 |
| ActivityNet Captions - Gao et al. (GRU, weakly-supervised) | 74.0 | 42.3 | 22.5 | 31.8 |
| ActivityNet Captions - Gao et al. (BERT, weakly-supervised) | 75.4 | 42.8 | 22.7 | 32.2 |
| ActivityNet Captions - Ours (weakly-supervised) | 74.2 | 44.3 | 23.6 | 32.2 |
| Charades-STA - CTRL (fully-supervised) | - | 23.6 | 8.9 | - |
| Charades-STA - Xu et al. (fully-supervised) | 54.7 | 35.6 | 15.8 | - |
| Charades-STA - He et al. (fully-supervised) | - | 36.7 | - | - |
| Charades-STA - Mithun et al. (weakly-supervised) | 32.1 | 19.9 | 8.8 | - |
| Charades-STA - Ours (weakly-supervised) | 39.8 | 27.3 | 12.9 | 27.3 |
- 在弱监督条件下,提出的两阶段模型在 ActivityNet Captions 和 Charades-STA 上取得了具有竞争力的结果。
- 仅粗略阶段就显著优于随机提案选择,并提供了扎实的定位基础。
- 细阶段的帧级交互加上基于分水岭的分组在时间边界精度上优于粗略结果。
- 完整的粗到细模型优于基线和若干弱监督方法,在关键指标上接近或超越一些全监督方法。
- 消融实验显示,使用双流粗糙定位器有利于粗略阶段,而基于全连接的细粒度定位器在细粒度定位方面表现更好。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。