QUICK REVIEW

[논문 리뷰] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Xiaohan Ding, Yiyuan Zhang|arXiv (Cornell University)|2023. 11. 27.

Advanced Neural Network Applications인용 수 34

한 줄 요약

UniRepLKNet은 대 커널 ConvNet에 네 가지 설계 지침과 Dilated Reparam Block을 도입하여 이미지와 시계열, 오디오를 포함한 다양한 모달리티에서 강력한 효율성으로 최첨단 성능을 달성합니다.

ABSTRACT

Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed large-kernel ConvNet shows leading performance in image recognition (ImageNet accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%), demonstrating better performance and higher speed than the recent powerful competitors. 2) We discover large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches, the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. All the code and models are publicly available on GitHub and Huggingface.

연구 동기 및 목표

대-커널 ConvNet의 설계 상의 격차를 동기부여하고 모달리티 전반에 걸친 보편적 인지 능력을 평가한다.
ERF 증가를 깊이와 분리하고 효율성을 개선하기 위한 네 가지 설계 지침을 제안한다.
모달리티별 전처리를 사용하여 이미지, 오디오, 비디오, 시계열 및 포인트 클라우드 전반에서 대 커널 ConvNet이 뛰어날 수 있음을 증명한다.
ImageNet, ADE20K, COCO 및 시계열/오디오 벤치마크에 걸친 실증 결과를 보여주어 보편성을 확립한다.]
method:[
Propose four architectural guidelines for large-kernel ConvNets: (1) use efficient inter-channel structures to increase depth; (2) employ a Dilated Reparam Block to re-parameterize large kernels via parallel small-kernel dilated branches; (3) place large kernels in middle/high layers and tailor kernel sizes to downstream tasks; (4) increase depth with small kernels rather than more large kernels.
Introduce Dilated Reparam Block which uses parallel dilated small-kernel branches that sum outputs; during inference, BN layers are merged and branches re-parameterized into a single large kernel.
Adopt a vanilla backbone structure with four stages and downsampling blocks, using large kernels in the middle/high stages (K=13) and SE blocks to increase depth efficiently.
Generalize UniRepLKNet to non-image modalities by transforming data into embedding maps of shape B x C' x H x W and applying the same backbone with minimal modality-specific preprocessing (time-series, audio, point cloud, video).
Provide a family of model instances (A, F, P, N, T, S, B, L, XL) with varying depth/width and report throughput and accuracy.

제안 방법

네 가지 설계 지침을 대-커널 ConvNet에 제시한다: (1) 깊이를 증가시키기 위해 효율적인 채널 간 구조를 사용한다; (2) 평행한 작은 커널의 dilated 가지를 합산하여 큰 커널을 재매개변수화하는 Dilated Reparam Block을 도입한다; (3) 중간/상위 계층에 큰 커널을 배치하고 다운스트림 작업에 맞춰 커널 크기를 조정한다; (4) 더 큰 커널 대신 작은 커널로 깊이를 증가시킨다.
Dilated Reparam Block을 도입하여 병렬로 확장된 작은 커널 가지를 사용하고 출력을 합산한다; 추론 시 BN 계층을 합치고 가지들을 하나의 큰 커널로 재매개변수화한다.
네 가지 계단형 백본 구조를 채택하고, 중간/상위 스테이지에서 큰 커널(K=13)을 사용하며 깊이를 효율적으로 늘리기 위해 SE 블록을 사용한다.
데이터를 형태가 B x C' x H x W인 임베딩 맵으로 변환하고 같은 백본을 적용하되 모달리티별 전처리를 최소화하여 이미지가 아닌 모달리티에 UniRepLKNet을 일반화한다 (시계열, 오디오, 포인트 클라우드, 비디오).
깊이/폭이 다른 계열의 모델 인스턴스(A, F, P, N, T, S, B, L, XL)를 제공하고 처리율과 정확도를 보고한다.

실험 결과

연구 질문

RQ1대-커널 ConvNet이 표준 시각 작업에서 최첨단 성능을 달성하면서도 높은 처리량을 유지할 수 있는가?
RQ2모달리티별 최소한의 사용자 정의로도 대-커널 ConvNet이 오디오, 비디오, 포인트 클라우드, 시계열 및 이미지 데이터 전반에 걸쳐 보편적 인지 능력을 보이는가?
RQ3ImageNet, ADE20K, COCO와 같은 다운스트림 작업에서 성능과 효율성을 최적화하는 설계 선택은 무엇인가?
RQ4적절한 다운스트림 프레임워크(예: 분할에서 UPerNet)와 결합했을 때 커널을 확장하는 것이 특징 품질을 보존하거나 향상시킬 수 있는 증거가 있는가?

주요 결과

UniRepLKNet은 변형 전반에서 ImageNet 상위-1 정확도가 83.9–87.9에 이르며 동급 또는 우수한 처리량을 보인다.
ImageNet에서 UniRepLKNet-A/F가 ConvNeXt V2-A/F보다 정확도 면에서 앞서고 더 빠르게 실행한다; UniRepLKNet-P/N은 FastViT-T12/S12 및 ConvNeXt V2 P/N을 능가한다.
객체 탐지 및 분할에서 UniRepLKNet 변형은 COCO의 AP/박스 및 ADE20K의 mIoU에서 높은 성과를 거두며 ViTs 및 대 커널 베이스라인을 능가한다.
작은 커널로 깊이를 확장하는 것이 속도-정확도 트레이드를 개선한다(LarK 대 SmaK 블록). 스테이지 3의 9개의 LarK 블록이 정확도와 처리량의 균형을 이룬다.
UniRepLKNet은 모달리티별 임베딩 맵을 사용한 시계열 예측과 오디오 인식에 동일한 백본을 적용하여 보편적 인지 능력을 보여주며 GFS 온도 및 풍속 예측에서 최첨단 결과를 얻는다.
다양한 모달리티에서 UniRepLKNet은 특화된 아키텍처를 능가하거나 일치하면서도 GPU에서 높은 처리량을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.