QUICK REVIEW

[论文解读] Focal Self-attention for Local-Global Interactions in Vision Transformers

Jianwei Yang, Chunyuan Li|arXiv (Cornell University)|Jul 1, 2021

Visual Attention and Saliency Detection参考文献 87被引用 267

一句话总结

本文提出 focal self-attention，将细粒度局部和粗粒度全局交互结合在 Vision Transformers 中，在 ImageNet、COCO 和 ADE20K 的多种模型规模上实现了state-of-the-art结果。

ABSTRACT

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5 and 83.8 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art Swin Transformers for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation, creating new SoTA on three of the most challenging computer vision tasks.

研究动机与目标

动机并解决全自注意力在高分辨率视觉任务（如检测和分割）中的二次复杂度问题。
提出 focal self-attention，以高效建模局部细粒度和全局粗粒度的交互。
开发具有多尺度结构的 Focal Transformer 变体，以实现准确的密集预测。
在分类、检测和分割任务上经验性验证相对于 SoTA Transformer 的改进。

提出的方法

定义 focal self-attention，对邻近的 token 以细粒度关注，对远处的 token 以粗粒度关注。
通过将特征图划分为窗口并对子窗口进行池化以实现多层 focal 级别，来实现基于窗口的 focal self-attention。
使用线性投影计算 queries、keys 和 values，并应用带相对位置偏置的多层注意。
采用多阶段、多尺度架构，结合 patch embeddings 和分阶段 focal blocks 来处理高分辨率输入。
在 ImageNet-1K、COCO 和 ADE20K 上训练和评估 Focal Transformer 变体（Focal-Tiny、Focal-Small、Focal-Base），并与 Swin Transformers 及其他基线进行比较。

实验结果

研究问题

RQ1在 Vision Transformers 中，focal self-attention 能否在不产生二次计算成本的前提下同时捕捉局部和全局交互？
RQ2与现有注意力策略相比，多尺度、基于窗口的 focal 机制是否在图像分类、目标检测和语义分割上提升性能？
RQ3Focal Transformer 变体在标准基准测试中相对于最先进模型的表现如何？

主要发现

Focal Transformers 在 ImageNet-1K 分类上超过相似规模和 FLOPs 的 SoTA Vision Transformer 基线。
Focal-Small 和 Focal-Base 在 Top-1 精度方面高于可比的 Swin 及其他 Transformer 模型。
在 COCO 的目标检测和实例分割中，Focal-Tiny/Small/Base 在多种检测器和时序上相较于 Swin Transformer 提供稳定的增益。
在 ADE20K 语义分割中，Focal-Tiny/Small/Base 在单尺度和多尺度设置下均优于同等规模的 Swin Transformer。
所提出的注意力机制实现了短距离细粒度和长距离粗粒度交互，且相比全量注意力具有更低的计算成本。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。