QUICK REVIEW

[论文解读] Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM

Lai Jiang, Mai Xu|arXiv (Cornell University)|Sep 19, 2017

Visual Attention and Saliency Detection参考文献 66被引用 72

一句话总结

本论文引入一个深度学习框架（OM-CNN 与 2C-LSTM），通过联合建模对象性、运动和帧间显著性转移来预测像素级视频显著性，在一个新的 LEDOV 数据集上进行训练。

ABSTRACT

Over the past few years, deep neural networks (DNNs) have exhibited great success in predicting the saliency of images. However, there are few works that apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient data to train the DNN models for predicting video saliency. Through the statistical analysis of our LEDOV database, we find that human attention is normally attracted by objects, particularly moving objects or the moving parts of objects. Accordingly, we propose an object-to-motion convolutional neural network (OM-CNN) to learn spatio-temporal features for predicting the intra-frame saliency via exploring the information of both objectness and object motion. We further find from our database that there exists a temporal correlation of human attention with a smooth saliency transition across video frames. Therefore, we develop a two-layer convolutional long short-term memory (2C-LSTM) network in our DNN-based method, using the extracted features of OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can be generated, which consider the transition of attention across video frames. Finally, the experimental results show that our method advances the state-of-the-art in video saliency prediction.

研究动机与目标

通过充足的训练数据，推动使用深度学习实现准确的视频显著性预测。
分析对象与运动在吸引人类在视频中注意力的作用。
开发能够建模帧内显著性及帧间显著性转移的架构。
提供一个大规模的眼动追踪视频数据库（LEDOV）以支持训练和评估。

提出的方法

提出带有两个子网的 OM-CNN：对象性和运动，其中对象性引导运动特征的提取。
用粗糙的对象性地图对运动特征进行掩模，以聚焦于对象区域。
将来自对象性的空间特征与来自运动的时间特征拼接，形成用于显著性预测的时空特征。
开发带有贝叶斯 dropout 的两层卷积 LSTM（2C-LSTM），以预测跨帧的像素级显著性转移。
使用两层转卷积层从 2C-LSTM 的输出生成逐帧显著性图。
在 LEDOV 数据上端到端训练，以学习动态显著性，而不假设固定的显著性分布。

实验结果

研究问题

RQ1在一个统一的 OM-CNN 中整合对象性和运动，是否能改善帧内显著性预测？
RQ2带有贝叶斯 dropout 的卷积 LSTM 架构是否能够捕捉跨视频帧的时序显著性转移？
RQ3相比于之前的方法，对象区域和运动线索在预测视频显著性方面的影响有多大？
RQ4大型 LEDOV 眼动追踪数据库如何支持视频显著性模型的学习与评估？

主要发现

所提出的 OM-CNN 能有效地整合对象性与运动以预测帧内显著性。
使用 2C-LSTM 的时序建模能够捕捉帧间显著性转移。
在 2C-LSTM 中使用贝叶斯 dropout 来处理显著性预测中的不确定性。
LEDOV 提供了一个大规模、多样化的视频眼动追踪数据集，用于训练和分析。
根据作者的实验，该方法推动了视频显著性预测的最新进展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。