[论文解读] Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks
提出外部注意力,使用两个外部记忆实现为微小的可学习线性层,达到线性复杂度,在包括一个名为 EAMLP 的全MLP 变体在内的视觉任务上取得有竞争力的结果。
Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This paper proposes a novel attention mechanism which we call external attention, based on two external, small, learnable, shared memories, which can be implemented easily by simply using two cascaded linear layers and two normalization layers; it conveniently replaces self-attention in existing popular architectures. External attention has linear complexity and implicitly considers the correlations between all data samples. We further incorporate the multi-head mechanism into external attention to provide an all-MLP architecture, external attention MLP (EAMLP), for image classification. Extensive experiments on image classification, object detection, semantic segmentation, instance segmentation, image generation, and point cloud analysis reveal that our method provides results comparable or superior to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
研究动机与目标
- Motivate and address the quadratic complexity and sample-invariant nature of self-attention in visual tasks.
- Introduce external attention with small shared memory units to capture dataset-level correlations.
- Show that external attention can replace self-attention in popular architectures with lower computation and memory costs.
- Demonstrate the versatility of external attention across image classification, detection, segmentation, generation, and 3D point cloud tasks.
- Propose multi-head external attention (EAMLP) to create an all-MLP architecture with competitive performance.
提出的方法
- Define external attention using two external memory units (M_k and M_v) as key and value memories.
- Compute attention as A = Norm(F M_k^T) and F_out = A M_v with linear layers implementing M_k and M_v.
- Use double normalization to stabilize attention scores across rows and columns.
- Extend to multi-head external attention for richer representations.
- Incorporate external attention into existing architectures and build an all-MLP model (EAMLP).
实验结果
研究问题
- RQ1Can external attention replace self-attention in vision architectures with linear computational cost?
- RQ2Does incorporating dataset-level external memories improve generalization and performance across diverse vision tasks?
- RQ3How does multi-head external attention (MEA) compare to self-attention and other attention variants in accuracy and efficiency?
- RQ4Can external attention enable an all-MLP vision model that matches CNN/Transformer performance on ImageNet?
- RQ5What is the impact of normalization strategies on external attention stability and performance?
主要发现
- External attention achieves comparable or superior results to self-attention across tasks with lower compute and memory usage.
- Using small shared memories (e.g., S ~ 64) yields linear complexity in input size (O(dSN)).
- Multi-head external attention enables an all-MLP architecture (EAMLP) with competitive ImageNet accuracy (up to 79.4% Top-1 in reported setup).
- Replacing self-attention with external attention improves segmentation and detection metrics in several benchmarks (e.g., VOC, COCO) when integrated into backbone networks.
- External attention provides interpretable attention maps showing focus on meaningful objects and regions, with heads attending to different regions.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。