Skip to main content
QUICK REVIEW

[论文解读] SpectFormer: Frequency and Attention is what you need in a Vision Transformer

Badri N. Patro, Vinay P. Namboodiri|arXiv (Cornell University)|Apr 13, 2023
Advanced Neural Network Applications被引用 53
一句话总结

SpectFormer 将基于傅里叶的频谱层与在后期阶段的多头自注意力相结合,以提升视觉Transformer的性能,在 ImageNet-1K 的小型/基础变体上达到最先进的结果,并在迁移/COCO 任务上表现强劲。

ABSTRACT

Vision transformers have been applied successfully for image recognition tasks. There have been either multi-headed self-attention based (ViT \cite{dosovitskiy2020image}, DeIT, \cite{touvron2021training}) similar to the original work in textual models or more recently based on spectral layers (Fnet\cite{lee2021fnet}, GFNet\cite{rao2021global}, AFNO\cite{guibas2021efficient}). We hypothesize that both spectral and multi-headed attention plays a major role. We investigate this hypothesis through this work and observe that indeed combining spectral and multi-headed attention layers provides a better transformer architecture. We thus propose the novel Spectformer architecture for transformers that combines spectral and multi-headed attention layers. We believe that the resulting representation allows the transformer to capture the feature representation appropriately and it yields improved performance over other transformer representations. For instance, it improves the top-1 accuracy by 2\% on ImageNet compared to both GFNet-H and LiT. SpectFormer-S reaches 84.25\% top-1 accuracy on ImageNet-1K (state of the art for small version). Further, Spectformer-L achieves 85.7\% that is the state of the art for the comparable base version of the transformers. We further ensure that we obtain reasonable results in other scenarios such as transfer learning on standard datasets such as CIFAR-10, CIFAR-100, Oxford-IIIT-flower, and Standford Car datasets. We then investigate its use in downstream tasks such of object detection and instance segmentation on the MS-COCO dataset and observe that Spectformer shows consistent performance that is comparable to the best backbones and can be further optimized and improved. Hence, we believe that combined spectral and attention layers are what are needed for vision transformers.

研究动机与目标

  • 说明将光谱与基于注意力的标记混合用于图像表示的必要性。
  • 设计一个统一的 SpectFormer 架构,在早期使用频谱层,在后期使用注意力层。
  • 在 ImageNet 和下游任务上对比基准、光谱和分层变换器,进行实证验证。
  • 展示 SpectFormer 在迁移学习以及目标检测/分割任务中的性能提升。

提出的方法

  • 引入 SpectFormer,采用两段式变换器块:用于局部频率捕获的频谱层(基于FFT的门控),随后是用于全局特征的多头自注意力。
  • 使用可调的 alpha 来控制块中频谱层与注意力层的数量比例。
  • 在标准 ViT 风格的流程中使用补丁嵌入层、位置编码和分类头。
  • 尝试若干频谱变体(FN、FGN、FNO、WGN),并发现 Fourier Gating Network (FGN) 最为有效。
  • 在 ImageNet-1K 及迁移数据集上,对比原始 SpectFormer 与分层 SpectFormer 与 DeIT、GFNet、AFNO、LiT、Swin 和 PVT。

实验结果

研究问题

  • RQ1将频谱层与多头注意力混合是否优于仅频谱或仅注意力的视觉变换器?
  • RQ2在频谱层与注意力层之间的最优分配(alpha)是多少,以获得最佳 ImageNet 性能?
  • RQ3在迁移学习场景(CIFAR、Flowers、Cars)中,SpectFormer 相较基线的表现如何?
  • RQ4SpectFormer 的变体是否在 MS COCO 的目标检测与实例分割等下游任务中有效?

主要发现

  • SpectFormer-S 在 ImageNet-1K 小型变体上达到 84.25% 的 top-1 准确率。
  • SpectFormer-L 在 ImageNet-1K 基线般的大型变体上达到 85.7% 的 top-1 准确率。
  • SpectFormer 在各尺寸上均超越 GFNet、AFNO、LiT 与 DeiT,且分层变体达到最先进的结果。
  • Fourier Gating Network (FGN) 作为光谱块变体,在 FN、FNO、WGN 与 FGN 中获得了最佳消融结果。
  • SpectFormer 在 CIFAR-10、CIFAR-100、Flowers、Cars 的迁移学习上显示持续的提升,在 MS COCO 的目标检测/分割上也具有竞争力。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。