QUICK REVIEW

[论文解读] Revisiting Video Saliency: A Large-scale Benchmark and a New Model

Wenguan Wang, Jianbing Shen|arXiv (Cornell University)|Jan 23, 2018

Visual Attention and Saliency Detection参考文献 63被引用 42

一句话总结

引入 DHF1K，一个具有 1K 视频和 600K+ 帧的大规模动态显著性数据集，并提出一种利用监督静态注意力来提升动态视频显著性预测的注意力 CNN-LSTM 模型。

ABSTRACT

In this work, we contribute to video saliency research in two ways. First, we introduce a new benchmark for predicting human eye movements during dynamic scene free-viewing, which is long-time urged in this field. Our dataset, named DHF1K (Dynamic Human Fixation), consists of 1K high-quality, elaborately selected video sequences spanning a large range of scenes, motions, object types and background complexity. Existing video saliency datasets lack variety and generality of common dynamic scenes and fall short in covering challenging situations in unconstrained environments. In contrast, DHF1K makes a significant leap in terms of scalability, diversity and difficulty, and is expected to boost video saliency modeling. Second, we propose a novel video saliency model that augments the CNN-LSTM network architecture with an attention mechanism to enable fast, end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. We thoroughly examine the performance of our model, with respect to state-of-the-art saliency models, on three large-scale datasets (i.e., DHF1K, Hollywood2, UCF sports). Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that our model outperforms other competitors.

研究动机与目标

为动态（视频）显著性创建一个标准化、大规模的基准数据集，涵盖多样场景、运动和凝视注释。
提出一个基于 CNN-LSTM 的视频显著性模型，结合监督注意机制以利用静态凝视数据。
分析并比较多项基准上的最先进视频显著性模型，以建立基线并为未来工作提供见解。

提出的方法

提出 DHF1K，包含 1,000 个视频（582,605 帧），每帧有来自 17 位观察者的凝视点，并提供用于更深入凝视分析的类别和属性注释。
开发一个注意力 CNN-LSTM 架构，其中 CNN 提取帧内静态特征，注意力模块将监督静态显著性注入到特征图中，convLSTM 模型化时间显著性动态。
使用来自 convLSTM 的 1x1 卷积的时间图来产生动态显著性预测，并通过注意力引导的残差连接来保留丰富的空间信息。
引入一个损失函数，将 KL 散度、线性相关系数（CC）以及基于 NSS 的项结合起来，以联合优化静态和动态显著性预测。
采用混合协议进行训练：对静态数据使用基于图像的注意力监督，对动态数据使用基于视频的监督，在 DHF1K 上使用 600/100/300 的 train/val/test 划分，在 Hollywood-2 和 UCF Sports 上使用可比的划分。
使用标准显著性指标（AUC-Judd、SIM、s-AUC、CC、NSS）在三个基准（DHF1K、Hollywood-2、UCF Sports）上报告性能。

Figure 1: Average annotation maps of three datasets used in benchmarking: (a) Hollywood-2, (b) UCF sports, (c) DHF1K.

实验结果

研究问题

RQ1利用静态显著性数据的有监督注意机制能否改进动态视频显著性预测？
RQ2带有注意力模块的 CNN-LSTM 框架是否在大规模、非受控视频数据集上胜过现有的动态显著性模型？
RQ3所提出的模型在多样数据集（DHF1K、Hollywood-2、UCF Sports）以及不同的训练配置下如何泛化？
RQ4使用不同训练数据量对动态显著性性能的影响是什么？

主要发现

DHF1K 是动态自由观看的最大的眼动数据集，包含 1,000 个视频和 582,605 帧，旨在提高泛化和基准测试。
该注意力 CNN-LSTM 模型在多个指标上持续超越 DHF1K、Hollywood-2 和 UCF Sports 的最先进动态显著性模型。
结合监督静态注意模块可改进空间特征表示，帮助时序显著性学习且无需光流。
用大规模数据进行训练可以提升性能，但数据多样性至关重要（例如，UCF Sports 受益于较小且多样性较低的训练集）。
该方法实现每帧推理快速（~0.08 s/224x224 帧），并受益于端到端训练，无需额外的预处理/后处理。

Figure 2: Example frames from DHF1K with fixations (red dots) and corresponding categories.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。