QUICK REVIEW

[论文解读] UniFormer: Unifying Convolution and Self-attention for Visual Recognition

Kunchang Li, Yali Wang|arXiv (Cornell University)|Jan 24, 2022

Advanced Neural Network Applications被引用 24

一句话总结

UniFormer 将卷积与自注意力在一个简洁的 transformer 块中统一，解决局部冗余和全局依赖问题，在图像和视频任务上实现强精度与效率。它引入 Dynamic Position Embedding 并具备局部（浅层）和全局（深层）令牌亲和性的多头关系聚合器。

ABSTRACT

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our UniFormer blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Sth-Sth V1/V2 video classification, 53.8 box AP and 46.4 mask AP on COCO object detection, 50.8 mIoU on ADE20K semantic segmentation, and 77.4 AP on COCO pose estimation. We further build an efficient UniFormer with 2-4x higher throughput. Code is available at https://github.com/Sense-X/UniFormer.

研究动机与目标

解释在视觉识别中平衡局部冗余降低与全局依赖捕捉的必要性。
提出一个统一的 transformer 块，在一个框架中融合卷积和自注意力机制。
设计一个轻量、灵活的骨干网络，在图像到视频任务上实现高效计算的良好表现。
在分类、检测、分割和姿态估计等任务上展示强大性能，同时无需额外训练数据或使用标准的 ImageNet 预训练。

提出的方法

引入 Dynamic Position Embedding (DPE) 通过轻量级的深度卷积注入位置信息。
开发 Multi-Head Relation Aggregator (MHRA)，在浅层提供局部亲和性，在深层提供全局亲和性。
将 MHRA 表示为 R_n(X)=A_n V_n(X) 且 MHRA(X)=Concat(R_1,...,R_N)U，从而实现统一的卷积/自注意力令牌关系学习。
将本地 MHRA 实例化为一个 PWConv-DWConv-PWConv 块，包含一个 5x5 的深度卷积和一个可学习的相对位置式矩阵。
将全局 MHRA 实例化为基于 Q/K 的令牌亲和性的多头自注意力，用于联合时空关系（图像作为 1 帧）。
将 UniFormer 块组装成四阶段的图像骨干，并扩展到视频的 3D，配合 BN/LN 和 FFN（GELU）进行特征细化。
提出一个高效的 Hourglass UniFormer (H-UniFormer) 变体，具有令牌收缩/恢复以提升吞吐量。

实验结果

研究问题

RQ1能否通过一个统一块同时结合局部卷积样亲和性和全局自注意力，在图像与视频任务上提升准确度与效率？
RQ2动态位置嵌入结合局部-随后全局的关系聚合器，是否比纯 CNN 或 ViT 提供更好的表征学习？
RQ3相较于现有骨干网络，UniFormer 在目标检测、分割和姿态估计等下游任务上的性能如何？
RQ4一个轻量级的 UniFormer 变体是否能在显著提升吞吐量的同时保持性能？

主要发现

在 ImageNet-1K 上在不额外训练数据的情况下达到 86.3 的 top-1 精度。
经过 ImageNet-1K 预训练，在 Kinetics-400/600 上达到 82.9/84.8 top-1，在 Something-Something V1/V2 上达到 60.9/71.2。
在 COCO 目标检测和实例分割任务中达到 53.8 的 box AP 和 46.4 的 mask AP。
在 ADE20K 语义分割上达到 50.8 mIoU，在 COCO 姿态估计上达到 77.4 AP。
UniFormer-Hourglass 变体相较于最近的轻量化模型提供 2–4× 的吞吐量提升，同时保持性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。