Skip to main content
QUICK REVIEW

[论文解读] Glance-and-Gaze Vision Transformer

Qihang Yu, Yingda Xia|arXiv (Cornell University)|Jun 4, 2021
Visual Attention and Saliency Detection参考文献 46被引用 33
一句话总结

GG-Transformer 引入 Glance 与 Gaze 分支以实现高效的长距离建模与局部上下文在视觉 Transformers 中的应用,获得在 ImageNet、ADE20K 与 COCO 上更优的准确性-成本权衡。它使用自适应扩张自注意力(G-MSA)加深度卷积的 Gaze 分支来实现局部性。

ABSTRACT

Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes, with the ability to efficiently model both long-range dependencies and local context. In GG-Transformer, the Glance and Gaze behavior is realized by two parallel branches: The Glance branch is achieved by performing self-attention on the adaptively-dilated partitions of the input, which leads to a linear complexity while still enjoying a global receptive field; The Gaze branch is implemented by a simple depth-wise convolutional layer, which compensates local image context to the features obtained by the Glance mechanism. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers on various vision tasks and benchmarks. The codes and models will be made available at https://github.com/yucornetto/GG-Transformer.

研究动机与目标

  • Motivate efficient Transformer design for high-resolution vision tasks requiring dense predictions.
  • Propose a Glance-and-Gaze Transformer block combining long-range attention with local detail via parallel branches.
  • Show that GG-Transformer achieves superior accuracy-cost trade-offs on ImageNet, ADE20K, and COCO compared to prior Transformers.

提出的方法

  • Glance branch: self-attention on adaptively-dilated partitions to preserve global receptive field with linear complexity.
  • Gaze branch: depthwise convolution to compensate local context on merged values.
  • GG-MSA: merge-and-attend within partitions to maintain global view with reduced computation (Ω(G-MSA)=4NC^2+2M^2NC).
  • Gaze branch options: fixed or adaptive kernel sizes to compensate local features (adaptive recommended).
  • Fully parallel GG-Transformer blocks built into a hierarchical backbone with four stages similar to Swin-Transformer for fair comparison.

实验结果

研究问题

  • RQ1Can Glance-and-Gaze (GG) blocks provide global long-range modeling without quadratic cost while preserving local detail?
  • RQ2Do GG-Transformer blocks improve accuracy on ImageNet, ADE20K, and COCO relative to Swin-Transformer and other ViTs at similar model sizes?
  • RQ3How do Glance and Gaze components contribute to performance, and is the combination superior to either alone?
  • RQ4Is GG-MSA a viable drop-in replacement for MSA in existing ViT architectures like DeiT?

主要发现

  • GG-Transformer achieves higher accuracy than comparable vision Transformers with similar FLOPs and parameters on ImageNet.
  • GG-T and GG-S match or exceed Swin-T/S in top-1 accuracy at the same model size and computation cost, with GG-T at 82.0% and GG-S at 83.4% on ImageNet (224^2).
  • On ADE20K, GG-T achieves 46.4% mIoU (single-scale) and 47.2% with test-time augmentation, outperforming ResNet50, PVT-Small, and Swin-T baselines; GG-S also surpasses Swin-S in mIoU (48.4%/49.6%).
  • On COCO object detection, GG-T and GG-S backbones achieve higher APs than CNN and ViT backbones of similar sizes; GG-T yields 44.1 AP^b and 39.9 AP^m with Mask R-CNN/Cascade Mask R-CNN setups, outperforming Swin-T at comparable costs.
  • Ablation shows Glance+Gaze with Conv-based Gaze outperforms MSA-only and Swin-style local-window approaches, with Glance+Gaze (Conv) achieving 80.28% top-1 on ImageNet in Swin-T baseline setting.
  • GG-MSA can improve DeiT backbones as well (GG-DeiT-T 73.8%, GG-DeiT-S 80.5%), demonstrating versatility beyond Swin-like architectures.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。