QUICK REVIEW

[论文解读] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Yuan, Li, Yunpeng Chen|arXiv (Cornell University)|Jan 28, 2021

Multimodal Machine Learning Applications被引用 32

一句话总结

本文提出 T2T-ViT，使用 Tokens-to-Token 模块逐步将图像分解为 token，并采用深-窄 backbone，使得在 ImageNet 从头训练时参数和 FLOPs 更少，同时在准确性方面超过 ViT和强 CNN 基线。

ABSTRACT

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0\% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3\% top1 accuracy in image resolution 384$ imes$384 on ImageNet. (Code: https://github.com/yitu-opensource/T2T-ViT)

研究动机与目标

解释为何在像 ImageNet 这样的中等规模数据集上从头训练时，纯 Transformer 架构的性能不及 CNN。
提出一个 Tokens-to-Token (T2T) 模块，以捕获局部图像结构并迭代地减少 token 长度。
设计一个高效的深-窄 ViT backbone，以提升特征丰富性并降低冗余。
证明 T2T-ViT 在 ImageNet 上在相似规模下无需大规模预训练即可超越 CNN。
展示 CNN 启发的架构选择如何使 ViT backbones 受益。

提出的方法

引入一个分层的 Tokens-to-Token (T2T) 模块，该模块交替进行 Re-structurization 和 Soft Split，将图像逐步转换为嵌入局部结构的 token。
使用具有更小隐藏维度和更多层的深-窄 ViT backbone，以在降低参数和 MACs 的同时维持性能。
在 T2T 模块中尝试 Transformer 与 Performer 层，以管理内存与计算。
在相当的模型规模下，在 ImageNet 上将 T2T-ViT 与 ViT、ResNets 和 MobileNets 进行对比。
进行消融研究以量化 T2T 模块与深-窄架构的影响，并探索对 CIFAR-10/100 的迁移。

实验结果

研究问题

RQ1能否通过渐进的 tokens-to-token 模块比在 ImageNet 上从头训练的 ViT 使用naive tokenization 更好地捕获局部图像结构？
RQ2相比标准 ViT，CNN 启发的深-窄 backbone 是否能降低冗余并提高视觉变换的特征丰富性？
RQ3在与参数数量和计算预算相近的条件下，T2T-ViT 从头在 ImageNet 上的表现相对于 ResNets 和 MobileNets 如何？
RQ4使用不同的 T2T 模块变体（Transformer 与 Performer）对性能和效率的影响如何？
RQ5预训练的 T2T-ViT 模型是否能有效迁移到 CIFAR-10/100 等下游数据集？

主要发现

模型	Top1-准确率 (%)	参数 (M)	MACs (G)
ViT-S/16 [12]	78.1	48.6	10.1
DeiT-small [36]	79.9	22.1	4.6
DeiT-small-Distilled [36]	81.2	22.1	4.7
T2T-ViT-14	81.5	21.5	4.8
T2T-ViT-14↑384	83.3	21.5	17.1
ViT-B/16 [12]	79.8	86.4	17.6
ViT-L/16 [12]	81.1	304.3	63.6
T2T-ViT-24	82.3	64.1	13.8
T2T-ViT t-14	81.7	21.5	6.1
T2T-ViT t-24	82.6	64.1	15.0

T2T-ViT 具有 21.5M 参数和 4.8G MACs，在从头训练的情况下在 ImageNet (224x224) 达到 81.5% 顶级-1 精度，优于 ViT-S/16 并在相似尺寸的 ResNets 间接或超过。
在 384x384 输入下，T2T-ViT-14↑ 达到 83.3% 顶级-1 精度，显示了在更高分辨率下的强劲提升。
与 ResNet50（25.5M 参数，4.3G MACs）相比，T2T-ViT-14 在 81.5% 的准确率达到更高的计算效率（t-variant 为 6.1G MACs），展示了在相近或更高的 compute 下的更好精度。
T2T-ViT-24 达到 82.3% 顶级-1，参数为 64.1M，MACs 为 13.8G，在更大规模下也展现出竞争力。
Lite T2T-ViT 模型（如 T2T-ViT-7/12）在准确率上可与 MobileNets 相竞争，尽管 MACs 较高；蒸馏还能进一步提升小模型。
将预训练的 T2T-ViT 转移到 CIFAR-10/100，获得相对于 ViT 基线的竞争性提升，展示了良好的迁移性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。