QUICK REVIEW

[论文解读] Advancing Vision Transformers with Group-Mix Attention

Chongjian Ge, Xiaohan Ding|arXiv (Cornell University)|Nov 26, 2023

Advanced Neural Network Applications被引用 12

一句话总结

引入 Group-Mix Attention (GMA) 以在 ViTs 中捕捉 token-to-token、token-to-group 和 group-to-group 关系，形成 GroupMixFormer 主干，在 ImageNet、COCO 和 ADE20K 上实现最先进的结果，参数量更少。

ABSTRACT

Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention map generated from the Query and Key captures only token-to-token correlations at one single granularity. In this paper, we argue that self-attention should have a more comprehensive mechanism to capture correlations among tokens and groups (i.e., multiple adjacent tokens) for higher representational capacity. Thereby, we propose Group-Mix Attention (GMA) as an advanced replacement for traditional self-attention, which can simultaneously capture token-to-token, token-to-group, and group-to-group correlations with various group sizes. To this end, GMA splits the Query, Key, and Value into segments uniformly and performs different group aggregations to generate group proxies. The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value. Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which achieves state-of-the-art performance in image classification, object detection, and semantic segmentation with fewer parameters than existing models. For instance, GroupMixFormer-L (with 70.3M parameters and 384^2 input) attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K.

研究动机与目标

激发并解决 ViTs 中朴素的 Q-K-V 自注意力仅在单一粒度上建模 token-to-token 相关性的局限性。
提出 Group-Mix Attention (GMA)，在多个分组大小下建模 token-to-token、token-to-group 和 group-to-group 的相关性。
将 GroupMixFormer 发展为一个利用 GMA 的分层视觉变换器主干，用于分类、检测和分割任务。
证明 GMA 在标准基准测试上能够在参数量竞争力或更少的情况下提升性能。

提出的方法

将 Q、K、V 分成多个片段，并通过具有不同核大小的聚合器生成分组代理。
在原始 token 与分组代理的混合上计算注意力，以捕捉多粒度相关性。
使用 token 集成层来融合来自注意力分支和非注意力分支的输出。
使用逐通道卷积作为聚合器，可选的恒等映射以保留 token 级相关性。
提供四种 GroupMixFormer 配置（M、T、S、B、L）及四阶段分层主干。
在 ImageNet-1K 上进行分类训练和评估；在 COCO 上进行检测/分割，使用 Mask R-CNN 和 RetinaNet；在 ADE20K 上使用 UperNet 和 Semantic FPN 进行分割。

实验结果

研究问题

RQ1Group-Mix Attention 是否能在每个 Transformer 编码器层内在多个粒度上建模 token 与 group 之间的相关性？
RQ2将 token-to-group 与 group-to-group 交互引入是否能在分类、检测、分割等任务中提升视觉表示，相对于传统自注意力？
RQ3GroupMixFormer 主干在 ImageNet、COCO、ADE20K 上在准确性和效率方面与最先进的 ViT 和 CNN 相比如何？
RQ4不同聚合器（核大小）及架构配置对性能的影响是什么？

主要发现

GroupMixFormer 在 ImageNet-1K 分类、COCO 目标检测/分割、以及 ADE20K 语义分割上实现了最先进或具竞争力的准确性。
较小的 GroupMixFormer 变体实现与较大模型相当的 ImageNet 性能，而较大变体在更高分辨率下达到更高的准确性。
在消融实验中，聚合器至关重要；在多个前注意力分支中加入基于分组的聚合能提升 Top-1 准确率以及检测/分割指标。
Group-Mix 机制对其他 ViT 架构也有好处，表明其广泛适用性超越 GroupMixFormer。
通过基于滑动窗口的聚合实现高效设计，在实际计算考虑下实现多粒度建模。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。