QUICK REVIEW

[论文解读] AIM 2024 Challenge on Video Saliency Prediction: Methods and Results

Andrey Moskalenko, Alexey Bryncev|arXiv (Cornell University)|Sep 23, 2024

Visual Attention and Saliency Detection被引用 8

一句话总结

本文综述 AIM 2024 视频显著性预测挑战，介绍通过众包鼠标跟踪收集的 AViMoS 数据集，并详细描述七个参赛方案，这些方案在很大程度上依赖基于 Transformer 的架构，部分方案结合了音频以及双分支/多分支设计。

ABSTRACT

This paper reviews the Challenge on Video Saliency Prediction at AIM 2024. The goal of the participants was to develop a method for predicting accurate saliency maps for the provided set of video sequences. Saliency maps are widely exploited in various applications, including video compression, quality assessment, visual perception studies, the advertising industry, etc. For this competition, a previously unused large-scale audio-visual mouse saliency (AViMoS) dataset of 1500 videos with more than 70 observers per video was collected using crowdsourced mouse tracking. The dataset collection methodology has been validated using conventional eye-tracking data and has shown high consistency. Over 30 teams registered in the challenge, and there are 7 teams that submitted the results in the final phase. The final phase solutions were tested and ranked by commonly used quality metrics on a private test subset. The results of this evaluation and the descriptions of the solutions are presented in this report. All data, including the private test subset, is made publicly available on the challenge homepage - https://challenges.videoprocessing.ai/challenges/video-saliency-prediction.html.

研究动机与目标

提供一个大规模的音视频鼠标显著性数据集（AViMoS），用于视频显著性预测，并验证其地面真实值的质量。
在私有测试子集中使用标准显著性指标对多种方法进行基准评估。
识别能产生最先进显著性预测的架构与模态（视觉、音频）。
提供可公开获取的数据、代码和结果，以促进可重复性和进一步研究。

提出的方法

使用带有基于 Transformer 的骨干网络（如 Video Swin Transformer）的编码器-解码器架构来提取时空特征。
在解码器中引入多分辨率特征，以处理不同的空间尺度。
探索双分支设计，通过跨注意力机制（如 SCAM）将低分辨率上下文与高分辨率细节分离。
在音视频显著性模型中在适用的情况下整合音频信息。
使用四个指标（AUC-Judd、CC、SIM、NSS）比较模型并报告跨指标的平均排名。
提供公开数据集划分（训练 1000 个视频，测试 500 个视频）以及用于最终评估的私有测试子集。

Figure 1 : RPN for video saliency prediction.

实验结果

研究问题

RQ1在大规模 AViMoS 数据集上训练时，基于 Transformer 的架构是否能够有效预测视频显著性？
RQ2在视频序列中纳入音频信息是否会提升显著性预测性能？
RQ3与单分支方法相比，多分支和多分辨率策略对显著性预测的准确性有何影响？
RQ4在 AIM 2024 AViMoS 基准上，模型规模（#params）与显著性预测性能之间的关系是什么？

主要发现

队伍名称	AUC-Judd	CC	SIM	NSS	排名	#Params(M)
CV_MM	0.894	0.774	0.635	3.464	1.00	420.5
VistaHL	0.892	0.769	0.623	3.352	2.75	187.7
PeRCeiVe Lab	0.857	0.766	0.610	3.422	3.75	402.9
SJTU-MML	0.858	0.760	0.615	3.356	4.00	1288.7
MVP	0.838	0.749	0.587	3.404	5.00	99.6
ZenithChaser	0.869	0.606	0.517	2.482	5.50	0.19
Exodus	0.861	0.599	0.510	2.491	6.00	69.7
Baseline	0.833	0.449	0.424	1.659	8.00	-

顶级方案在很大程度上使用基于 Transformer 的编码器来提取时空特征。
冠军队伍（CV_MM）将 UMT 模型与多分辨率解码器特征相结合。
第二名（VistaHL）提出了一种双流方法，由低分辨率上下文分支引导的高分辨率细节分支组成。
若干队伍采用音频信息来形成音频-视觉显著性模型（SJTU-MML、Exodus）。
AViMoS 数据集在经过筛选和对齐后，与眼动数据的地面真实值对齐度很高（AUC-Judd>0.91, CC>0.84, SIM>0.74）。
公开和私有测试结果使用多项指标（AUC-Judd、CC、SIM、NSS）进行报告，显示各队的竞争性表现。
基线 center-prior 与主办方的基线为比较提供参考点。

Figure 2 : An overview of the proposed network. SC [ 25 ] , SE [ 17 ] , and ShuffleAttn [ 54 ] are plug-and-play attention modules. SWF and GA stand for Saliency-Weighted Feature Module and Gated Attention, respectively.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。