QUICK REVIEW

[论文解读] LocalViT: Analyzing Locality in Vision Transformers

Yawei Li, Kai Zhang|arXiv (Cornell University)|Apr 12, 2021

Advanced Neural Network Applications参考文献 44被引用 283

一句话总结

LocalViT 通过在前馈网络中插入逐通道卷积，为视觉Transformer加入局部性机制，在不显著增加成本的情况下提升 ImageNet 的准确率，并在多种 Transformer 架构上显示出泛化性。

ABSTRACT

The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking is a locality mechanism for information exchange within a local region. In this paper, locality mechanism is systematically investigated by carefully designed controlled experiments. We add locality to vision transformers into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to vision transformers with different architecture designs, which shows the generalization of the locality concept. For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines Swin-T, DeiT-T, and PVT-T by 1.0%, 2.6% and 3.1% with a negligible increase in the number of parameters and computational effort. Code is available at https://github.com/ofsoundof/LocalViT.

研究动机与目标

推动将局部性机制整合到 vision transformers 中，以捕捉局部图像结构。
提出一种通过在前馈网络中插入深度卷积来增强局部性的 transformer。
分析局部性、激活函数和扩展比对性能的影响。
在多种 vision transformer 架构上展示该方法以体现普适性。

提出的方法

将输入解释为一组 token 嵌入的序列并重新排列成二维格子（Seq2Img）。
用一个受 inverted residuals 启发的模块替代前馈网络，其中包含 1x1 卷积和一个深度卷积的 2D 卷积。
在深度卷积后使用非线性激活（例如 ReLU6、h-swish），以及可选的注意力模块（ECA/SE）。
在前馈网络之前分割类标记，在处理图像标记后再拼接回来，以保持分类行为。
将 locality 应用于选定的 transformer 层，并分析其位置和扩展比（gamma）对性能的影响。

实验结果

研究问题

RQ1将 locality 注入到前馈网络是否在不显著增加参数量或 FLOPs 的情况下提高 vision transformer 的准确性？
RQ2激活函数、层放置和隐藏维度扩展如何影响 locality 的收益？
RQ3 locality 机制在不同的 vision transformer 架构（如 DeiT、T2T-ViT、PVT、TNT）上的泛化能力如何？

主要发现

仅深度卷积就能改善基线 transformer。
深度卷��后的激活函数选择对增益影响显著（例如搭配 SE/ECA 的 h-swish 带来更大增益）。
局部性在较低层的 transformer 中比在较高层更有益。
提高隐藏维度扩展比（gamma）带来更大的容量和准确性提升。
该局部性机制在 DeiT、T2T-ViT、PVT 和 TNT 上具备泛化性，在若干情况下对基线系统有显著改进。
在 ImageNet 上，LocalViT 的变体相较 DeiT-T 和 PVT-T 基线在几个百分点的提升，且具有最小的参数/计算开销。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。