QUICK REVIEW

[论文解读] Scalable Visual Transformers with Hierarchical Pooling

Zizheng Pan, Bohan Zhuang|arXiv (Cornell University)|Mar 19, 2021

Advanced Neural Network Applications参考文献 47被引用 28

一句话总结

本文提出分层视觉变换器（HVT），一种可扩展的视觉变换器架构，通过逐步池化视觉令牌来减少序列长度和计算成本，模仿卷积神经网络（CNN）的分层特征学习。通过在不增加浮点运算次数（FLOPs）的前提下保持高模型容量，HVT在ImageNet和CIFAR-100数据集上实现了最先进性能，且其FLOPs与基线方法相当。

ABSTRACT

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.

研究动机与目标

为解决标准ViT模型在推理过程中保持完整长度的图像块序列所导致的计算冗余问题。
实现在深度、宽度、分辨率和图像块大小方面无计算复杂度增加的可扩展模型增长。
通过用更具判别性的池化视觉令牌替代或补充单一分类令牌，改善特征表示。
证明在视觉变换器中采用分层池化可在与现有方法相当的FLOPs下实现更优性能。

提出的方法

引入分层池化层，逐步减少视觉令牌的序列长度，类似于卷积神经网络中的下采样操作。
在空间维度上应用平均池化以压缩序列，降低计算成本，同时保留判别性特征。
用池化后的视觉令牌替代或补充分类令牌，实证表明这些池化令牌包含比单一分类令牌更多的判别性信息。
设计一种分层架构，使不同分辨率下的多个池化阶段被应用于构建多尺度表征。
通过增加深度、宽度和分辨率来保持高模型容量，同时因序列长度减少而避免FLOPs增长。

实验结果

研究问题

RQ1在视觉变换器中采用分层池化是否能在不损失性能的前提下减少序列长度和计算成本？
RQ2用池化后的视觉令牌替代或补充分类令牌是否能提升特征的判别能力？
RQ3能否在不增加FLOPs的前提下，实现模型容量在深度、宽度、分辨率和图像块大小上的扩展？
RQ4在标准基准测试中，所提出的HVT与SOTA ViT和CNN基线相比，在准确率和FLOPs方面表现如何？

主要发现

在FLOPs相当的情况下，HVT在ImageNet数据集上优于竞争性基线，展现出更强的可扩展性和准确性。
HVT在CIFAR-100上实现了最先进性能，且FLOPs与基线相当，证实其在小规模数据集上的有效性。
实证结果表明，平均池化后的视觉令牌比单一分类令牌包含更多判别性信息，支持该设计选择。
分层池化机制通过减少序列长度，使模型能在不增加计算成本的前提下实现跨维度的扩展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。