QUICK REVIEW

[论文解读] Unsupervised Extraction of Video Highlights Via Robust Recurrent Auto-encoders

Huan Yang, Baoyuan Wang|arXiv (Cornell University)|Oct 6, 2015

Video Analysis and Summarization参考文献 31被引用 26

一句话总结

该论文提出了一种基于网络爬取的用户编辑视频进行训练的鲁棒循环自编码器（RRAE）的无监督视频精彩片段提取方法。通过利用编辑片段中频繁出现的子事件，并采用收缩指数损失函数以增强对噪声的鲁棒性，同时使用双向LSTM进行时序建模，该方法在无需原始视频对的情况下，实现了接近监督方法的性能，展示了在多样化视频领域中的强大泛化能力。

ABSTRACT

With the growing popularity of short-form video sharing platforms such as \em{Instagram} and \em{Vine}, there has been an increasing need for techniques that automatically extract highlights from video. Whereas prior works have approached this problem with heuristic rules or supervised learning, we present an unsupervised learning approach that takes advantage of the abundance of user-edited videos on social media websites such as YouTube. Based on the idea that the most significant sub-events within a video class are commonly present among edited videos while less interesting ones appear less frequently, we identify the significant sub-events via a robust recurrent auto-encoder trained on a collection of user-edited videos queried for each particular class of interest. The auto-encoder is trained using a proposed shrinking exponential loss function that makes it robust to noise in the web-crawled training data, and is configured with bidirectional long short term memory (LSTM)~\cite{LSTM:97} cells to better model the temporal structure of highlight segments. Different from supervised techniques, our method can infer highlights using only a set of downloaded edited videos, without also needing their pre-edited counterparts which are rarely available online. Extensive experiments indicate the promise of our proposed solution in this challenging unsupervised settin

研究动机与目标

为解决视频精彩片段提取中缺乏成对原始视频与编辑视频的问题，而这类数据极为稀少。
利用社交媒体上丰富的用户编辑短视频作为无监督训练数据的来源。
将精彩片段子事件建模为在编辑视频中频繁出现的模式，将不常见或个性化的片段视为异常值。
开发一种鲁棒的学习框架，即使在存在噪声的网络爬取训练数据下仍能保持有效性。
证明在缺乏真实编辑对的情况下，无监督方法仍可实现与监督方法相当的性能。

提出的方法

该方法使用带有双向LSTM单元的循环自编码器（RAE）来建模视频精彩片段中的时序依赖性。
提出一种新颖的收缩指数损失函数，以在训练过程中降低噪声或异常样本的影响。
自编码器被训练以准确重建输入视频片段，重建误差越低，表示越可能是精彩片段。
使用C3D网络提取特征，随后通过领域特定的主成分分析（PCA）降低维度，同时保留90%的能量。
将重建误差较低的片段识别为精彩片段，假设常见子事件（即精彩片段）在特征空间中聚集。
模型仅在下载的编辑视频上进行训练，无需访问原始未编辑的源视频。

实验结果

研究问题

RQ1能否仅使用网络上获取的编辑视频，以无监督方式有效识别视频精彩片段？
RQ2如何使自编码器在面对网络爬取视频数据中的噪声时具备鲁棒性，以实现精彩片段检测？
RQ3通过双向LSTM建模时序上下文，能在多大程度上提升精彩片段检测的性能？
RQ4当缺乏原始视频对时，所提出的无监督方法与监督基线方法相比表现如何？
RQ5在用户编辑视频中频繁出现的子事件能否可靠地指示显著的精彩时刻？

主要发现

所提出的鲁棒循环自编码器（RRAE）在YouTube数据集上达到0.434的mAP，优于所有领域中的标准自编码器、PCA和OCSVM。
引入双向LSTM使性能提升超过10%，mAP从0.371提高到0.410，证明了时序建模的重要性。
收缩指数损失显著提升了对噪声数据的鲁棒性，有效降低了异常值在训练过程中的影响。
即使未访问原始视频对，无监督RRAE在mAP上与Sun等人提出的监督方法仅存在微小差距（在“dog”上为0.60 vs. 0.49，在“surfing”上为0.61 vs. 0.49）。
该方法在“体操”、“跑酷”、“滑板”和“滑雪”等多种不同领域中均表现出良好的泛化能力，性能稳定一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。