QUICK REVIEW

[论文解读] Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach

Aidean Sharghi, Jacob Laurel|arXiv (Cornell University)|Jul 16, 2017

Video Analysis and Summarization参考文献 1被引用 27

一句话总结

本文提出了一种基于记忆网络与顺序确定性点过程（sequential determinantal point process）的查询聚焦视频摘要框架，用于关注用户查询并生成个性化摘要。该工作引入了一个包含密集镜头级概念标注的新数据集，以及一种基于语义的评估指标，在自动评估与人工评估中均展现出优于基线方法的性能。

ABSTRACT

Recent years have witnessed a resurgence of interest in video summarization. However, one of the main obstacles to the research on video summarization is the user subjectivity - users have various preferences over the summaries. The subjectiveness causes at least two problems. First, no single video summarizer fits all users unless it interacts with and adapts to the individual users. Second, it is very challenging to evaluate the performance of a video summarizer. To tackle the first problem, we explore the recently proposed query-focused video summarization which introduces user preferences in the form of text queries about the video into the summarization process. We propose a memory network parameterized sequential determinantal point process in order to attend the user query onto different video frames and shots. To address the second challenge, we contend that a good evaluation metric for video summarization should focus on the semantic information that humans can perceive rather than the visual features or temporal overlaps. To this end, we collect dense per-video-shot concept annotations, compile a new dataset, and suggest an efficient evaluation method defined upon the concept annotations. We conduct extensive experiments contrasting our video summarizer to existing ones and present detailed analyses about the dataset and the new evaluation method.

研究动机与目标

通过基于用户提供的文本查询个性化摘要，解决视频摘要中的用户主观性问题。
通过聚焦语义内容而非视觉或时间重叠，克服视频摘要系统评估的挑战。
构建一个包含密集镜头级概念标注的新数据集，以实现更准确且与人类判断对齐的评估。
设计一种神经架构，有效整合查询信息与视频内容，生成多样且相关的摘要。

提出的方法

提出一种参数化为序列确定性点过程（DPP）的记忆网络，用于关注用户查询并选择相关视频镜头。
使用二值语义向量表示镜头级概念，通过基于IOU的度量计算语义相似性。
将查询嵌入整合到记忆网络中，以引导对视频帧和镜头的注意力。
采用具有学习参数的DPP核，建模所选镜头的多样性，防止冗余。
使用可微分目标端到端训练模型，同时优化与查询的相关性与所选镜头的多样性。
基于用户标注摘要与系统生成摘要的语义向量之间的IOU相似性，定义一种新型评估指标。

实验结果

研究问题

RQ1查询聚焦的视频摘要能否生成与用户偏好一致的个性化摘要？
RQ2与现有的ROUGE-SU4等指标相比，所提出的语义评估指标与人工判断的相关性如何？
RQ3在查询聚焦设置下，所提出的记忆网络结合DPP在多大程度上提升了摘要质量？
RQ4所提出模型的各个组件（如注意力机制、嵌入维度、DPP）对性能的贡献如何？
RQ5包含密集镜头级概念标注的新数据集是否能实现更可靠且细粒度的视频摘要系统评估？

主要发现

在通用视频摘要场景中，所提模型即使在基线方法（如SubMod和Quasi）已知最优摘要长度的情况下，仍优于现有方法。
消融研究证实，注意力机制、嵌入层与DPP三者协同提升性能，任一组件被移除均导致性能显著下降。
当从用户摘要中随机移除镜头时，该评估指标的召回率呈线性下降，表现出一致且可预测的行为；而ROUGE-SU4则表现出非线性特征。
基于概念标注的语义评估指标相较于基于字幕的指标（如ROUGE-SU4）与人类感知的相关性更高，体现在对细微视觉细节的捕捉更稳定、更全面。
包含密集镜头级概念标注的数据集能够实现更细致且可靠的评估，如其可捕捉到字幕无法反映的语义差异。
模型的注意力机制能有效对齐用户查询，表现为在多种查询类型下均能选择语义相关的镜头。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。