QUICK REVIEW

[论文解读] ExCL: Extractive Clip Localization Using Natural Language Descriptions

Soham Ghosh, Anuva Agarwal|arXiv (Cornell University)|Apr 4, 2019

Video Analysis and Summarization被引用 78

一句话总结

ExCL 通过学习跨模态交互，在给定自然语言查询时预测视频片段的精确起始帧和结束帧，在 TACoS 和 ActivityNet 上超越了以往基于排名的方法，并在 Charades-STA 上与之匹配。它建模三种跨度预测器变体，并同时使用分类和回归目标。

ABSTRACT

The task of retrieving clips within videos based on a given natural language query requires cross-modal reasoning over multiple frames. Prior approaches such as sliding window classifiers are inefficient, while text-clip similarity driven ranking-based approaches such as segment proposal networks are far more complicated. In order to select the most relevant video clip corresponding to the given text description, we propose a novel extractive approach that predicts the start and end frames by leveraging cross-modal interactions between the text and video - this removes the need to retrieve and re-rank multiple proposal segments. Using recurrent networks we encode the two modalities into a joint representation which is then used in different variants of start-end frame predictor networks. Through extensive experimentation and ablative analysis, we demonstrate that our simple and elegant approach significantly outperforms state of the art on two datasets and has comparable performance on a third.

研究动机与目标

推动基于提取的片段定位，区别于依赖固定候选剪辑的 ranking-based 方法。
提出一个模块化的跨模态框架，直接从文本-视频交互中预测起始和结束帧。
在不同数据集上评估不同的跨度预测器架构与训练目标。
展示具有时间上下文的提取式模型在性能上强劲并且在不同数据集上具备泛化能力。

提出的方法

使用 GloVe 嵌入通过双向 LSTM 编码文本以获得句子嵌入。
对 I3D 特征使用双向 LSTM 编码视频以捕捉时间上下文。
使用三种跨度预测器变体（MLP、Tied-LSTM、Conditioned-LSTM）计算每帧的起始/结束分数。
训练时使用分类损失（softmax 归一化的起始/结束概率）或回归损失（对 softmax 分布的期望）。
对于回归，通过在屏蔽对数上应用 SoftMax 来建模 P(end|start)，以强制 end >= start，并将期望的起始/结束时间用作预测。

实验结果

研究问题

RQ1一个提取式的端到端模型是否能够在不对多个提案进行排序的情况下定位自然语言查询所描述的确切视频片段？
RQ2不同的跨模态跨度预测器架构如何影响在不同数据集上的定位精度？
RQ3对于精确的时间定位，回归目标是否优于分类目标？
RQ4在视频长度和词汇量各异的数据集上，模型的表现如何？
RQ5包含视频 LSTM 编码器对性能的影响有多大？

主要发现

数据集	IoU=0.3	IoU=0.5	IoU=0.7
TACoS	22.6	12.6	5.1
TACoS	42.0	25.0	12.3
TACoS	41.9	25.5	13.6
TACoS	41.7	26.0	12.9
TACoS	44.2	28.0	14.6
TACoS	44.4	27.8	14.6
TACoS	26.2	11.9	4.8
TACoS	45.2	27.5	12.9
TACoS	41.4	24.8	11.4
TACoS	42.2	27.2	11.7
TACoS	45.5	28.0	13.8
TACoS	42.3	27.3	12.5
Charades-STA	55.4	30.4	12.1
Charades-STA	64.7	43.8	23.0
Charades-STA	64.2	43.9	23.4
Charades-STA	64.6	41.5	23.1
Charades-STA	65.1	44.1	23.4
Charades-STA	61.4	41.8	22.4
ActivityNet	42.5	23.8	12.1
ActivityNet	60.7	40.9	23.4
ActivityNet	60.7	40.9	23.4
ActivityNet	60.4	40.5	23.1
ActivityNet	61.1	41.3	23.4
ActivityNet	62.1	41.6	23.9
ActivityNet	48.4	27.0	11.0
ActivityNet	63.0	43.6	23.6
ActivityNet	61.5	42.7	23.4
ActivityNet	61.5	41.9	23.3
ActivityNet	62.3	42.7	24.1
ActivityNet	61.4	41.7	22.4

提取式模型在 TACoS 和 ActivityNet 上显著超过先前的基于排名的基线。
引入视频 LSTM 能显著提升性能，且带有循环编码的跨度预测器（尤其是 tied LSTM）在各数据集上取得强劲结果。
回归训练提供与分类相当或略优的结果，信息损失不明显。
若没有视频 LSTM，循环跨度预测器对捕获跨模态交互至关重要。
Tied-LSTM 跨度预测器在各数据集和设定中通常提供最佳或接近最佳的结果。
典型地，TACoS 由于对时间精度要求高，仍然是最具挑战性的基准。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。