QUICK REVIEW

[论文解读] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Kunchang Li, Yali Wang|arXiv (Cornell University)|Nov 17, 2022

Visual Attention and Saliency Detection被引用 58

一句话总结

UniFormerV2 将图像预训练的 ViT 与简洁的 UniFormer 视频设计结合起来，以学习时空表征，在 8 个视频基准上达到最先进的结果，并在 Kinetics-400 上实现 90.0% 的 top-1。

ABSTRACT

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. Code will be available at https://github.com/OpenGVLab/UniFormerV2.

研究动机与目标

通过为开源的图像预训练 ViTs 配合 UniFormer 风格的视频模块，构建强大视频模型的务实范式。
设计局部和全局关系聚合器，以在准确性和计算量之间取得平衡。
实现多阶段融合，以整合多尺度的时空表征。
通过在多样化基准测试（Kinetics-400/600/700、Moments in Time、Something-Something V1/V2、ActivityNet、HACS）上进行验证。
通过统一的后预训练基准（Kinetics-710）演示其有效性。

提出的方法

在 ViT 块之前插入一个局部时序 MHRA，引入一个局部 UniBlock，以在利用预训练的空间特征的同时降低时序冗余。
在每个局部块之上增加一个全局 UniBlock，使用基于查询的跨 MHRA 将 token 摘要成一个视频 token，时间复杂度为线性。
使用多阶段融合块将来自多个阶段的全局 token 集成为最终的视频表示。
重用并改编 UniFormer 中的 MHRA，使用局部 LT_MHRA 和全局 GS_MHRA 以实现高效的时空建模。
通过 3D 卷积将输入投射为时空 token；进行时间维降采样；应用局部和全局 UniBlocks；融合多阶段输出；可选地用类别 token 进行最终融合。
探索四种融合策略（Sequential、Parallel、Hierarchical KV、Hierarchical Q）以跨阶段组合全局 token。

实验结果

研究问题

RQ1是否可以将图像预训练的 ViTs 与 UniFormer 风格的视频设计有效结合，以提升时空学习？
RQ2与现有视频模型在标准基准上的准确性与效率权衡是多少？
RQ3全局 token 的多阶段融合如何影响最终视频表示？
RQ4在统一的 Kinetics-710 基准上进行后预训练是否在 Kinetics-400/600/700、MiT 等数据集上产生一致的增益？
RQ5所提出的跨 MHRA 全局块在保持或提升性能的同时是否计算高效？

主要发现

在包括 Kinetics-400/600/700、Moments in Time、Something-Something V1/V2、ActivityNet 和 HACS 在内的 8 个流行视频基准上取得最先进的结果。
首次达到 Kinetics-400 的 90.0% top-1 准确率。
在各数据集上都表现出色，具有有利的准确率-参数量和 FLOP 权衡。
在 Kinetics-710 上进行后预训练可实现强迁移，额外微调需求很少（在 K400/600/700 上有示例）。
证明了给图像预训练的 ViT 配备 UniFormer 设计可以在无需大量图像预训练的情况下为视频任务提供稳健的时空表征。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。