QUICK REVIEW

[论文解读] UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

Kunchang Li, Yali Wang|arXiv (Cornell University)|Jan 12, 2022

Human Pose and Action Recognition被引用 108

一句话总结

UniFormer 将3D卷积和时空自注意力整合到一个统一的 transformer 中，以高效学习视频表示中的局部冗余和全局依赖，在显著减少 GFLOPs 的情况下实现强精度。

ABSTRACT

It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy. Different from traditional transformers, our relation aggregator can tackle both spatiotemporal redundancy and dependency, by learning local and global token affinity respectively in shallow and deep layers. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods. For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60.9% and 71.2% top-1 accuracy respectively. Code is available at https://github.com/Sense-X/UniFormer.

研究动机与目标

通过解决局部冗余和长期依赖来推动对高维视频数据的高效时空学习。
提出一个统一的 transformer（UniFormer），在任务优化的架构中将本地的类 3D 卷积操作与全局自注意力融合。
设计一个多头关系聚合器，分别在浅层和深层实现局部和全局的令牌关系处理。
在 Kinetics-400/600 和 Something-Something V1/V2 上展示最先进的性能，同时降低 GFLOPs。
提供消融与分析以理解统一注意力、动态位置嵌入以及分阶段对性能的影响。

提出的方法

引入包含动态位置嵌入（DPE）、多头关系聚合器（MHRA）和前馈网络（FFN）的 UniFormer 模块。
MHRA 在浅层通过类似时空卷积的局部令牌相似度矩阵学习局部关系，在深层通过基于内容的相似性（Q/K）学习全局关系，类似自注意力。
DPE 使用3D 深度卷积扩展条件位置编码以维持时空顺序并处理可变剪辑长度。
在一个四阶段的分层网络中堆叠 UniFormer 模块，在早期阶段采用局部 MHRA，在后期阶段采用全局 MHRA，实现在时空上下文的联合建模。
给出局部 MHRA 的卷积样解释，作为 PWConv-DWConv-PWConv 块，并展示相较于纯注意力设计的效率提升。
在 Kinetics-400/600 和 Something-Something V1/V2 上给出在 ImageNet-1K 预训练下的实验结果，在显著更低的 GFLOPs 下实现高精度。

实验结果

研究问题

RQ1一个统一的 transformer 架构是否能够共同优化局部时空冗余降低和全局依赖建模，从而实现高效的视频理解？
RQ2在单个 MHRA 模块中将3D卷积样的局部关系与全局自注意力结合，是否比现有视频 transformer 在计算-准确度折衷上有更好表现？
RQ3动态位置嵌入和基于块的设计选择（各阶段的局部 vs 全局 MHRA）如何影响性能与迁移性？
RQ4预训练、输入管道大小和采样策略对 UniFormer 的鲁棒性和迁移学习有何影响？
RQ5与最先进方法相比，UniFormer 在标准视频基准（Kinetics-400/600，Something-Something V1/V2）上的表现如何？

主要发现

方法	预训练	#帧	GFLOPs	K400 Top-1	K400 Top-5	K600 Top-1	K600 Top-5
Our UniFormer-S	IN-1K	16 × 1 × 4	167	80.8	94.7	82.8	95.8
Our UniFormer-S	IN-1K	16 × 3 × 1	125	57.6	84.9	69.4	92.1
Our UniFormer-B	IN-1K	16 × 3 × 1	290	60.9	87.3	71.2	92.8
Our UniFormer-B	IN-1K	32 × 3 × 1	777	61.0	87.6	71.2	92.8
Our UniFormer-B	IN-1K	32 × 3 × 4	3108	83.0	95.4	84.9	96.7

在 ImageNet-1K 预训练且 GFLOPs 远低于许多 SOTA 方法的情况下，在 Kinetics-400 上达到 82.9% 的 top-1，在 Kinetics-600 上达到 84.8%，并且 GFLOPs 比许多 SOTA 方法少 10 倍。
超越 Something-Something V1 的前一代方法，在 V1 达到 60.9% top-1，V2 达到 71.2% top-1。
浅层局部 MHRA 有效降低局部冗余、计算量低，而深层全局 MHRA 捕获远距离依赖并具有高判别能力。
联合时空 MHRA 优于分离的空间/时间注意力，提升迁移学习性能。
动态位置嵌入（DPE）通过编码时空位置信息来提高准确性（在 Kinetics-400 上最高可提升约 1.7% 的 top-1）。
消融显示在早期阶段使用局部 MHRA、后期阶段使用全局 MHRA 时实现有利的平衡，优于纯局部或纯全局配置。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。