QUICK REVIEW

[论文解读] Visual Attention Methods in Deep Learning: An In-Depth Survey

Mohammed Hassanin, Saeed Anwar|arXiv (Cornell University)|Apr 16, 2022

Visual Attention and Saliency Detection被引用 56

一句话总结

本综述全面评估了50种用于视觉的深度学习注意力技术，进行分类并讨论它们的构建块、优势和局限性。

ABSTRACT

Inspired by the human cognitive system, attention is a mechanism that imitates the human cognitive awareness about specific information, amplifying critical details to focus more on the essential aspects of data. Deep learning has employed attention to boost performance for many applications. Interestingly, the same attention design can suit processing different data modalities and can easily be incorporated into large networks. Furthermore, multiple complementary attention mechanisms can be incorporated into one network. Hence, attention techniques have become extremely attractive. However, the literature lacks a comprehensive survey on attention techniques to guide researchers in employing attention in their deep models. Note that, besides being demanding in terms of training data and computational resources, transformers only cover a single category in self-attention out of the many categories available. We fill this gap and provide an in-depth survey of 50 attention techniques, categorizing them by their most prominent features. We initiate our discussion by introducing the fundamental concepts behind the success of the attention mechanism. Next, we furnish some essentials such as the strengths and limitations of each attention category, describe their fundamental building blocks, basic formulations with primary usage, and applications specifically for computer vision. We also discuss the challenges and general open questions related to attention mechanisms. Finally, we recommend possible future research directions for deep attention. All the information about visual attention methods in deep learning is provided at \href{https://github.com/saeed-anwar/VisualAttention}{https://github.com/saeed-anwar/VisualAttention}

研究动机与目标

促使研究人员理解超越 transformers 的面向视觉的注意力机制的广泛谱系。
提供一个统一的注意力技术分类（soft、hard、multi-modal、arithmetic、logical 等），并将其映射到核心构建块。
总结注意力模块在计算机视觉中的基本概念、优点/局限性，以及主要用途。
突出在视觉领域深度注意力中的挑战、空白与未来研究方向。

提出的方法

将注意力机制分类为主导类别，如 soft (deterministic) attention、hard (stochastic) attention、multi-modal、arithmetic、logical，以及自学习方法。
描述核心构建块和基本公式（例如 channel attention、spatial attention、self-attention），并给出代表性示例（SE、CBAM、ECA、DAN、A2-Nets 等）。
解释注意力分数如何计算（例如通过 softmax、sigmoid、池化或频率分量）以及如何整合关注的特征。
讨论基于 transformer 的 self-attention，以及它在视觉中的众多注意力类型中的一个类别的角色。
讨论架构和计算方面的考虑，包括内存/计算权衡以及对不同视觉任务的适用性。

Figure 1: Visual charts show the increase in the number of attention related papers in the top conferences including CVPR, ICCV, ECCV, NeurIPS, ICML, and ICLR.

实验结果

研究问题

RQ1在视觉深度学习中使用的主导注意力机制类别有哪些？
RQ2每个注意力类别的优点、局限性和核心构建块是什么？
RQ3注意力技术如何影响识别、分割和检测等常见计算机视觉任务？
RQ4在超越 transformer 的方法应用于视觉方面，深度注意力面临的挑战和未解问题是什么？
RQ5哪些未来的研究方向可以推动视觉中的深度注意力方法？

主要发现

视觉中的注意力机制多样化，可以分为多种类别，超越 self-attention 和 transformers。
Channel attention、spatial attention 和 self-attention 构成核心的 soft attention 子类型，具有各自的优点和局限性。
Transformer-based self-attention 仅代表所述的 50 种注意力技术中的一个子集，且在计算和数据方面代价较高。
混合和多分支注意力模块（如 A2-Nets、DAN、Harmonious Attention）可以捕捉高阶或跨特征的交互。
有显著的设计趋势，如使用二阶统计、频域分量和自学习架构来增强注意力。
该综述识别了研究空白并提出未来在视觉中实现鲁棒、高效和可泛化的深度注意力的方法。

Figure 3: Core structures of the channel-based attention methods. Different methods to generate the attention scores including squeeze and excitation [ 26 ] , splitting and squeezing [ 23 ] , calculating the second order [ 37 ] or efficient squeezing and excitation [ 22 ] . Images are taken from the

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。