QUICK REVIEW

[论文解读] ResT: An Efficient Transformer for Visual Recognition

Qinglong Zhang, Yu-Bin Yang|arXiv (Cornell University)|May 28, 2021

Advanced Neural Network Applications参考文献 36被引用 148

一句话总结

ResT 引入了一种节省内存的多尺度 Vision Transformer 主干，具备 EMSA 注意力、灵活的空间位置编码和重叠的补丁嵌入，在 ImageNet 与 COCO 上取得了出色的结果。

ABSTRACT

This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the 2D-reshaped token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.

研究动机与目标

开发一个通用、具有主干的图像识别架构，将 CNN 的局部性与 Transformer 的全局推理相结合。
在保持多头多样性的同时，降低自注意力的内存与计算成本。
实现灵活的输入尺寸和用于密集预测任务的多尺度特征图。
在 ImageNet-1k 分类以及如目标检测和实例分割等下游任务上验证 ResT。
证明 ResT 在相似模型规模下优于可比主干网络。

提出的方法

引入高效多头自注意力（EMSA），它使用深度卷积来压缩空间令牌并在注意力头之间投影交互。
用基于卷积的重叠补丁嵌入替代固定的补丁标记化，以构建多尺度特征金字塔。
将位置编码定义为空间注意力（PA），以在不进行插值或微调的情况下处理可变输入尺寸。
在 EMSA 内嵌入 1×1 卷积 + Instance Normalization，以恢复头部多样性并稳定训练。
使用分阶段的补丁嵌入，逐步增大通道维度并降低空间分辨率，形成类似 ResT 的主干。
在下游框架中采用前置归一化，并采用简单的全局平均池化分类器用于 ImageNet-1k 评估。

实验结果

研究问题

RQ1如何在不降低性能的前提下，使 Vision Transformer 的自注意力对内存更友好？
RQ2空间条件的位置编码是否能够实现灵活的输入尺寸和用于密集预测的多尺度表示？
RQ3与标准标记化相比，重叠的补丁嵌入是否提升了低层特征捕获与整体准确性？
RQ4相对于具有类似成本的主干，ResT 主干在 ImageNet-1k 与 COCO 目标检测/实例分割上能带来哪些性能提升？

主要发现

模型	参数量 (M)	FLOPs (G)	吞吐量 (图像/秒)	Top-1 (%)	Top-5 (%)
ResT-Lite	10.49	1.4	1246	77.2 (↑7.5)	93.7 (↑4.6)
ResT-Small	13.66	1.9	1043	79.6 (↑9.9)	94.9 (↑5.8)
ResT-Base	30.28	4.3	673	81.6 (↑2.6)	95.7 (↑1.3)
ResT-Large	51.63	7.9	429	83.6 (↑3.3)	96.3 (↑1.1)

ResT-Small 在 ImageNet-1k 上以 1.9G FLOPs 和 13.66M 参数实现 79.6% 的 Top-1 准确率。
ResT-Large 在 7.9G FLOPs 和 51.63M 参数下达到 83.6% 的 Top-1 准确率，超越同等价格的 Swin 变体。
在 COCO 目标检测（RetinaNet）上，ResT-Small 相对 PVT-T 的 AP 提升 3.6 点（40.3 vs 36.7）。
在 COCO 目标检测（RetinaNet）上，ResT-Base 相对 PVT-S 的 AP 提升 1.6 点（42.0 vs 40.4）。
ResT-Large 在基于 Mask RCNN 的实例分割中取得显著提升（APbox 41.6，APmask 38.7），相对于同价位的 PVT-S 和 Swin 变体。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。