QUICK REVIEW

[论文解读] Multiscale Vision Transformers

Haoqi Fan, Bo Xiong|arXiv (Cornell University)|Apr 22, 2021

Advanced Neural Network Applications参考文献 109被引用 55

一句话总结

Multiscale Vision Transformers (MViT) 将多尺度特征层次与变换器融合，用于视频和图像识别，在相比依赖大规模预训练的并行ViT模型时，凭借更低计算量实现了较高的准确性。

ABSTRACT

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast

研究动机与目标

激发在变换器模型中通过多尺度特征层次结构来利用密集的视觉信号结构。
提出一种多尺度变换器架构，在降低时空分辨率的同时逐步增加通道容量。
在视频识别基准（Kinetics、Charades、SSv2、AVA）和图像分类（ImageNet）上评估 MViT，且不使用外部预训练。
在精度、计算量（FLOPs）和参数效率方面将 MViT 与当代视觉变换器进行对比。

提出的方法

引入多头池化注意力（MHPA），对 Q、K 和 V 序列进行池化，以在一个变换器块内实现灵活的时空分辨率。
定义带有核大小、步幅和填充的池化运算符 P，以降低序列长度，从而加速注意力计算。
将网络组织为尺度阶段，每个阶段在扩展通道容量的同时对时空分辨率进行下采样。
使用跳跃连接，通过池化和线性层对齐维度，以适应分辨率和通道变化。
具体实现 MViT 变体（如 MViT-B、MViT-S），给出特定阶段配置、分辨率和通道增长，以在准确性和效率之间取得平衡。
从头在 Kinetics 上进行训练，不使用 ImageNet 预训练，报告推理 FLOPs、内存和准确性；并与 ViT 基线以及其他视频模型进行比较。

实验结果

研究问题

RQ1通过 MHPA 引入多尺度特征层次结构如何影响视频识别任务的准确性和效率？
RQ2在没有大规模外部预训练的情况下，MViT 是否能够达到与并行视觉变换器竞争或更优的性能？
RQ3当使用像 MHPA 那样的多尺度时空卷积核时，视频变换器的时间偏置是否会改变？
RQ4多尺度设计在没有时间组件的图像分类任务中的迁移性能如何？

主要发现

MViT 在不使用外部预训练数据的情况下，相对并行视频变换器实现显著的性能提升。
在可比或更高的准确性下，MViT 实现了比若干基于 ViT 的视频模型更低的计算量和参数数量。
将该架构应用于图像分类（通过移除时间维）相比之前的视觉变换器有改进。
基于 MHPA 的多尺度设计使在时空中对密集视觉信号的建模更高效。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。