Skip to main content
QUICK REVIEW

[Paper Review] An Empirical Study of Spatial Attention Mechanisms in Deep Networks

Xizhou Zhu, Dazhi Cheng|arXiv (Cornell University)|Apr 11, 2019
Advanced Neural Network Applications50 references102 citations
TL;DR

This paper conducts a comprehensive ablation study of spatial attention mechanisms across Transformer attention, deformable convolution, and dynamic convolution, revealing surprising roles for query content and key content factors in self- versus encoder-decoder attention.

ABSTRACT

Attention mechanisms have become a popular component in deep neural networks, yet there has been little examination of how different influencing factors and methods for computing attention from these factors affect performance. Toward a better general understanding of attention mechanisms, we present an empirical study that ablates various spatial attention elements within a generalized attention formulation, encompassing the dominant Transformer attention as well as the prevalent deformable convolution and dynamic convolution modules. Conducted on a variety of applications, the study yields significant findings about spatial attention in deep networks, some of which run counter to conventional understanding. For example, we find that the query and key content comparison in Transformer attention is negligible for self-attention, but vital for encoder-decoder attention. A proper combination of deformable convolution with key content only saliency achieves the best accuracy-efficiency tradeoff in self-attention. Our results suggest that there exists much room for improvement in the design of attention mechanisms.

Motivation & Objective

  • Clarify how different attention factors (query content, key content, relative position) affect performance across NLP and vision tasks.
  • Unify Transformer attention, deformable convolution, and dynamic convolution under a generalized spatial attention framework.
  • Identify which attention components are crucial for self-attention versus encoder-decoder attention.
  • Evaluate accuracy-efficiency tradeoffs of attention module variants in object detection, semantic segmentation, and neural machine translation.

Proposed method

  • Propose a generalized multi-head attention formulation that encompasses Transformer attention, regular/deformable convolution, and dynamic convolution (Eq. 1).
  • Decompose Transformer attention into four terms (E1–E4) corresponding to query/key content, query content with relative position, key content, and relative position.
  • Perform ablations by selectively activating terms via beta parameters to study their impact on performance and efficiency (Eq. 8).
  • Incorporate attention modules into backbones for object detection and segmentation and into Transformer-based NMT models to compare accuracy and FLOPs across tasks.
  • Compare deformable convolution and dynamic convolution against Transformer attention by aligning their factor usage within the unified framework.
  • Use standard benchmarks: COCO for object detection, Cityscapes for semantic segmentation, and WMT14 English–German for NMT.

Experimental results

Research questions

  • RQ1What is the measured impact of each attention factor (query content, key content, relative position) on performance in self-attention vs encoder-decoder attention?
  • RQ2Can deformable convolution or dynamic convolution achieve better accuracy-efficiency tradeoffs than standard Transformer attention for vision tasks?
  • RQ3How does combining deformable convolution with key content saliency affect accuracy and efficiency in self-attention?
  • RQ4Are non-query-sensitive attention terms (key content, relative position) essential for high performance in certain settings?
  • RQ5What general guidelines emerge for designing spatial attention mechanisms across NLP and vision applications?

Key findings

  • In Transformer attention, the query-sensitive terms (especially the query and key content) play a minor role in self-attention but are vital for encoder-decoder attention.
  • A proper combination of deformable convolution with the key content only term yields the best accuracy-efficiency tradeoff in self-attention for image recognition.
  • In self-attention, the factors of query content & relative position and key content only are the most important, and evaluating different term configurations shows substantial performance gains with selective term usage.
  • Modules with only query-sensitive terms can perform comparably to those using query-irrelevant terms, suggesting design issues rather than intrinsic properties of self-attention.
  • Deformable convolution operates effectively by leveraging query content and relative position and can outperform Transformer attention in image recognition when paired appropriately with key-content cues.
  • Overall, the study reveals there is substantial room for improvement in spatial attention design beyond conventional query-centric intuition.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.