QUICK REVIEW

[논문 리뷰] UniFormer: Unifying Convolution and Self-attention for Visual Recognition

Kunchang Li, Yali Wang|arXiv (Cornell University)|2022. 01. 24.

Advanced Neural Network Applications인용 수 24

한 줄 요약

UniFormer은 나라 컨볼루션과 셀프 어텐션을 간결한 트랜스포머 블록으로 통합하여 지역 중복성과 글로벌 의존성 문제를 해결하고 이미지 및 비디오 태스크에서 강한 정확도-효율성을 달성합니다. 또한 동적 위치 임베딩과 로컬(얕은) 및 글로벌(깊은) 토큰 친화력을 갖춘 다중 헤드 관계 집계기를 도입합니다.

ABSTRACT

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our UniFormer blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Sth-Sth V1/V2 video classification, 53.8 box AP and 46.4 mask AP on COCO object detection, 50.8 mIoU on ADE20K semantic segmentation, and 77.4 AP on COCO pose estimation. We further build an efficient UniFormer with 2-4x higher throughput. Code is available at https://github.com/Sense-X/UniFormer.

연구 동기 및 목표

시각 인식에서 로컬 중복 감소와 글로벌 의존성 포착의 균형 필요성을 동기로 삼는다.
합성곱과 자기 주의 메커니즘을 단일 프레임워크로 혼합한 통합 트랜스포머 블록을 제안한다.
효율적인 계산으로 이미지에서 비디오 태스크까지 잘 작동하는 경량의 융통성 있는 백본을 설계한다.
추가 학습 데이터 없이 또는 표준 ImageNet 사전학습으로 분류, 탐지, 분할, 자세 추정에서 강력한 성능을 입증한다.

제안 방법

Dynamic Position Embedding (DPE)을 도입하여 경량 깊이별 합성곱을 통해 위치 정보를 주입한다.
얕은 계층에서 로컬 친화력, 깊은 계층에서 글로벌 친화력을 제공하는 Multi-Head Relation Aggregator (MHRA)를 개발한다.
MHRA를 R_n(X)=A_n V_n(X) 및 MHRA(X)=Concat(R_1,...,R_N)U로 형식화하여 합일된 컨볼루션/셀프 어텐션 토큰 관계 학습을 가능하게 한다.
로컬 MHRA를 5x5 깊이별 합성곱(DWConv)과 학습 가능한 상대 위치 유사 행렬을 갖는 PWConv-DWConv-PWConv 블록으로 구현한다.
전역 MHRA를 Q/K 기반 토큰 친화력을 갖는 다중 헤드 셀프 어텐션으로 구현하여 시공간 관계를 결합한다(이미지는 1 프레임으로 간주).
이미지를 위한 네 단계 백본으로 UniFormer 블록을 구성하고 비디오용으로 3D로 확장하며, 특성 개선을 위한 BN/LN 및 FFN(GELU)을 사용한다.
토큰 축소/복구를 통한 효율적인 Hourglass UniFormer(H-UniFormer) 변형을 제안하여 처리량을 향상시킨다.

실험 결과

연구 질문

RQ1로컬 컨볼루션 유사 친화력과 글로벌 셀프 어텐션을 결합한 통합 블록이 이미지 및 비디오 태스크 전반에서 정확도와 효율성을 향상시킬 수 있는가?
RQ2로컬-그룹의 결합된 동적 위치 임베딩이 순수 CNN이나 ViT보다 더 나은 표현 학습을 제공하는가?
RQ3기존 백본과 비교하여 객체 탐지, 분할, 자세 추정과 같은 다운스트림 작업에서 UniFormer의 성능은 어떤가?
RQ4경량화된 UniFormer 변형이 처리량을 크게 증가시키면서도 성능을 유지할 수 있는가?

주요 결과

추가 학습 데이터 없이 ImageNet-1K에서 86.3 top-1 정확도를 달성한다.
ImageNet-1K 사전 학습으로 Kinetics-400/600에서 82.9/84.8 top-1, Something-Something V1/V2에서 60.9/71.2를 달성한다.
COCO 객체 탐지 및 인스턴스 분할 과제에서 53.8 box AP 및 46.4 mask AP를 달성한다.
ADE20K 시맨틱 분할에서 50.8 mIoU와 COCO 자세 추정에서 77.4 AP를 달성한다.
UniFormer-Hourglass 변형은 최근 경량 모델 대비 처리량을 2–4배까지 높이면서도 성능을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.