QUICK REVIEW

[论文解读] M$^3$Net: Multilevel, Mixed and Multistage Attention Network for Salient Object Detection

Yuan Yao, Pan Gao|arXiv (Cornell University)|Sep 15, 2023

Visual Attention and Saliency Detection被引用 11

一句话总结

引入 M3 Net，采用 Multilevel Interaction Block 和 Mixed Attention Block，在多阶段解码器中以提升显著对象检测，在六个数据集上实现了 state-of-the-art 结果。

ABSTRACT

Most existing salient object detection methods mostly use U-Net or feature pyramid structure, which simply aggregates feature maps of different scales, ignoring the uniqueness and interdependence of them and their respective contributions to the final prediction. To overcome these, we propose the M$^3$Net, i.e., the Multilevel, Mixed and Multistage attention network for Salient Object Detection (SOD). Firstly, we propose Multiscale Interaction Block which innovatively introduces the cross-attention approach to achieve the interaction between multilevel features, allowing high-level features to guide low-level feature learning and thus enhancing salient regions. Secondly, considering the fact that previous Transformer based SOD methods locate salient regions only using global self-attention while inevitably overlooking the details of complex objects, we propose the Mixed Attention Block. This block combines global self-attention and window self-attention, aiming at modeling context at both global and local levels to further improve the accuracy of the prediction map. Finally, we proposed a multilevel supervision strategy to optimize the aggregated feature stage-by-stage. Experiments on six challenging datasets demonstrate that the proposed M$^3$Net surpasses recent CNN and Transformer-based SOD arts in terms of four metrics. Codes are available at https://github.com/I2-Multimedia-Lab/M3Net.

研究动机与目标

重新思考多层级特征如何超越简单聚合来对显著性进行预测。
提出机制使跨层级交互成为可能，使高层特征引导低层学习。
通过结合全局自注意力和窗口化自注意力来解决基于 Transformer 的 SOD 的局部细节丢失问题。
开发一个多阶段解码器，使用多层级监督来逐步细化显著性图。

提出的方法

引入 Multilevel Interaction Block (MIB)，以在低层和高层特征之间实现跨注意力，使高层线索引导低层细化。
引入 Mixed Attention Block (MAB)，将全局自注意力与窗口自注意力融合，用于全局和局部上下文建模。
采用一个多阶段解码器，顺序融合特征且不使用卷积操作，使用基于令牌的上采样 (RT2T) 及折叠覆盖。
在每个解码阶段应用多层级监督以优化中间预测。
训练基于 Swin Transformer 的编码器（骨干网络可替换）以及一个带有跨尺度注意力的 U 形多尺度解码器。

实验结果

研究问题

RQ1如何交互性地利用多层级特征来提升显著性预测？
RQ2全局注意力与局部注意力的结合能否保留 SOD 中的细粒对象细节？
RQ3相比传统解码器，分阶段、逐步监督的解码器是否能提升显著性图的质量？

主要发现

M3 Net 在六个具有挑战的数据集、四个指标上超越了最近的 CNN 和基于 Transformer 的 SOD 方法。
Multilevel Interaction Block 通过允许高层特征引导低层特征，有效增强显著区域。
The Mixed Attention Block 同时建模全局上下文和局部细节，提升预测准确性和细节保留。
具有多层级监督的多阶段解码器在降低低层特征中非显著信息的同时，产生准确的显著性图。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。