QUICK REVIEW

[论文解读] DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction

Hamed R. Tavakoli, Ali Borji|arXiv (Cornell University)|May 25, 2019

Visual Attention and Saliency Detection参考文献 64被引用 28

一句话总结

本文提出 DAVE，一种简单而高效的深度音视频嵌入模型，用于动态显著性预测，通过联合利用视觉和听觉线索。在新构建的音视频眼动追踪语料库（AVE）上进行训练后，模型表明音频显著提升了显著性预测性能——在53.54%的帧上优于仅使用视觉的基线模型，并且在可见声源位置与人类注视模式高度一致。

ABSTRACT

This paper studies audio-visual deep saliency prediction. It introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed ``DAVE" in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named ``AVE". Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous in dynamic scenes. Here, we investigate the applicability of audio cues in conjunction with visual ones in predicting saliency maps using deep neural networks. To this end, the proposed model is intentionally designed to be simple. Two baseline models are developed on the same architecture which consists of an encoder-decoder. The encoder projects the input into a feature space followed by a decoder that infers saliency. We conduct an extensive analysis on different modalities and various aspects of multi-model dynamic saliency prediction. Our results suggest that (1) audio is a strong contributing cue for saliency prediction, (2) salient visible sound-source is the natural cause of the superiority of our Audio-Visual model, (3) richer feature representations for the input space leads to more powerful predictions even in absence of more sophisticated saliency decoders, and (4) Audio-Visual model improves over 53.54\% of the frames predicted by the best Visual model (our baseline). Our endeavour demonstrates that audio is an important cue that boosts dynamic video saliency prediction and helps models to approach human performance. The code is available at https://github.com/hrtavakoli/DAVE

研究动机与目标

为解决现有视频显著性模型中音频整合不足的问题，这些模型主要依赖视觉线索。
构建一个大规模、多源的音视频眼动追踪数据库（AVE），用于训练和评估深度音视频显著性模型。
通过受控的消融实验和模态分析，探究音频作为显著性线索在动态视频场景中的贡献。
开发一种简单、可端到端训练的深度神经网络架构，支持视觉、音频和音视频模态之间的公平比较。
评估更丰富的输入表征（如3D CNN特征）是否能在不增加解码器复杂度的前提下，提升显著性预测性能。

提出的方法

提出一种简单的编码器-解码器架构用于显著性预测，视觉、音频和音视频模型共享组件，以确保公平比较。
使用在大规模视频数据集上预训练的3D卷积神经网络（3D CNNs），从视频输入中提取丰富的时空特征。
应用1D卷积神经网络从原始音频波形中提取时间音频特征，实现与视觉特征的联合处理。
在网络早期阶段融合视觉与音频特征，再通过共享的解码器头预测显著性图。
使用在自由观看条件下收集的人类眼动追踪数据中的真实注视图，对完整模型进行端到端训练。
在三种视频类别（如自然场景、访谈、体育）中进行消融研究，分析不同刺激类型下各模态的贡献。

实验结果

研究问题

RQ1与仅使用视觉的模型相比，音频信息是否显著提升动态视频显著性预测性能？
RQ2可见声源的存在如何影响音视频显著性模型的性能？
RQ3更丰富的输入级表征（如在大规模视频数据集上预训练的3D CNN）在独立于解码器复杂度的情况下，能在多大程度上提升显著性预测性能？
RQ4在人类注视预测精度方面，音视频模型与现有仅视频的显著性模型相比如何？
RQ5该模型的行为是否与人类注意力模式一致，特别是在关注声源位置方面？

主要发现

音频是动态显著性预测中强有力且显著的贡献因素，音视频模型在53.54%的帧上优于仅视觉的基线模型。
音视频模型在所有评估指标和所有视频类别中均表现出更优性能，显示出对基线模型的一致性优势。
模型的注意力在可见声源位置与人类注视高度一致，表明音频有助于将注意力准确定位于正确空间位置。
更丰富的输入级特征（如在大规模视频数据集上预训练的3D CNN）即使在解码器架构保持简单的情况下，也能带来更好的显著性预测结果。
音视频模型在预测活跃声源位置的注视点方面显著优于仅视觉模型，证实了音频在引导注意力方面的作用。
该模型在不同视频类型中表现一致，且音频的贡献在声源清晰可见的场景中最为显著。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。