QUICK REVIEW

[论文解读] DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

Jiayu Jiao, Yuming Tang|arXiv (Cornell University)|Feb 3, 2023

Advanced Neural Network Applications被引用 12

一句话总结

DilateFormer 引入了具多尺度膨胀的 Transformer，采用 Sliding Window Dilated Attention 和 MSDA，在显著减少 FLOPs 的同时，在 ImageNet、COCO 和 ADE20K 上取得了强劲的结果，优于以往 SOTA Vision Transformers。

ABSTRACT

As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20K semantic segmentation task.

研究动机与目标

推动减少 Vision Transformers 中全局自注意的二次复杂度和冗余，同时保持或提高性能。
分析浅层补丁交互以揭示局部性和稀疏性，为高效注意力机制的设计提供指导。
提出 SWDA 和 MSDA，在金字塔结构的 Transformer 中建模局部和多尺度的补丁依赖。
通过在浅层阶段结合 MSDA，在更深阶段使用 MHSA，构建 DilateFormer 骨干网络，并在分类、检测和分割任务上进行评估。

提出的方法

提出 Sliding Window Dilated Attention (SWDA)，在每个查询补丁周围的膨胀滑动窗口中，对稀疏选择的补丁执行自注意。
通过将通道分成具有不同膨胀率（例如 1、2、3）的头来引入 Multi-Scale Dilated Attention (MSDA)，以捕捉注意域内的多尺度依赖。
采用金字塔结构，其中浅层阶段使用 MSDA，深层阶段使用标准 MHSA，从而在降低计算量的同时实现多尺度特征提取。
采用重叠 tokenizer 和重叠 downsampler 进行补丁嵌入与分辨率控制，并通过深度卷积实现条件位置嵌入 (CPE) 以实现输入分辨率的自适应。
提供三种模型变体（Tiny、Small、Base）及分阶段配置，并报告在 ImageNet-1K、COCO 和 ADE20K 上相对于先前 Vision Transformers 的改进。

实验结果

研究问题

RQ1与全局自注意相比，Sliding Window Dilated Attention (SWDA) 是否在降低计算成本的同时保持或提高性能？
RQ2Multi-Scale Dilated Attention (MSDA) 是否能够在单个块内有效捕捉多尺度上下文且不增加额外参数或成本？
RQ3金字塔 DilateFormer 骨干在 ImageNet-1K 分类、COCO 目标检测/分割和 ADE20K 语义分割上的表现，相对于最先进的方法如何？
RQ4在视觉任务中在浅层使用 MSDA 而在深层使用 MHSA 的权衡是什么？

主要发现

DilateFormer 在 ImageNet-1K 上以明显更少的 FLOPs 达到接近或最先进的准确率（例如 Dilate-S 4.8 GFLOPs 达到 83.3% top-1；Dilate-B 10.0 GFLOPs 在不同设置下达到 84.4%–85.6% top-1）。
在 Token Labeling 的情况下， Dilate-S⋆ 和 Dilate-B⋆ 在 ImageNet-1K 上分别达到 83.9% 和 84.9% 的 top-1 精度，在同等成本下超过了若干 LV-ViT 变体。
在 COCO 目标检测/实例分割中，Dilate-B 在标准配置下达到 53.5% 的 box mAP 和 46.1% 的 mask mAP，且在替代调度下为 49.9/43.7；在 ADE20K 上，Dilate-B 达到 51.1% 的 MS mIoU。
DilateFormer 相较于某些 SOTA Vision Transformers 展示了 70% 更少的 FLOPs，同时保持可比或更好的性能，验证了 MSDA 的高效性以及局部性-稀疏性指引设计的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。