QUICK REVIEW

[论文解读] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Xiaohan Ding, Yiyuan Zhang|arXiv (Cornell University)|Nov 27, 2023

Advanced Neural Network Applications被引用 34

一句话总结

UniRepLKNet 引入大核卷积网络，具有四条架构准则和一个 Dilated Reparam Block，通过实现跨图像和多模态（包括时间序列和音频）的最先进结果，同时保持高效。

ABSTRACT

Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed large-kernel ConvNet shows leading performance in image recognition (ImageNet accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%), demonstrating better performance and higher speed than the recent powerful competitors. 2) We discover large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches, the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. All the code and models are publicly available on GitHub and Huggingface.

研究动机与目标

激发大核卷积网络在架构设计上的差距，并评估其在跨模态的通用感知能力。
提出四条架构指南，使 ERF 增长与深度解耦，并提升效率。
证明在对不同模态进行特定预处理的前提下，具有大核的卷积网络可在图像、音频、视频、时间序列和点云上表现出色。
给出在 ImageNet、ADE20K、COCO 及时间序列/音频基准测试上的实证结果，以确立其通用性。

提出的方法

使用高效的跨通道结构以增加深度；
采用 Dilated Reparam Block，通过并行的小核膨胀分支对大核进行重新参数化；
在中高层放置大核并根据下游任务调整核大小；
通过使用较小核来增加深度，而不是增多大核。
引入 Dilated Reparam Block，它使用并行的膨胀小核分支来求和输出；在推理阶段，BN 层被合并，分支重新参数化为一个单一的大核。
采用具有四个阶段和下采样块的普通骨干结构，在中高阶段使用大核（K=13），并使用 SE 块以高效地增加深度。
通过将数据转换为形状为 B x C' x H x W 的嵌入映射，并应用相同的骨干网络，对非图像模态泛化 UniRepLKNet，且仅需最少的模态特异性预处理（时间序列、音频、点云、视频）。
提供一系列模型实例（A, F, P, N, T, S, B, L, XL），具有不同深度/宽度，并报告吞吐量和准确率。

实验结果

研究问题

RQ1大核卷积网络在保持高吞吐量的同时，是否能够在标准视觉任务上达到最先进的表现？
RQ2在仅做最少的模态特异性定制的情况下，大核卷积网络是否具备跨音频、视频、点云、时间序列和图像数据的通用感知能力？
RQ3哪些架构选择最优化大型卷核卷积网络在 ImageNet、ADE20K、COCO 等下游任务上的性能与效率？
RQ4是否有证据表明，在与适当的下游框架配合时增大卷核仍然可以保持或提升特征质量？

主要发现

方法	类型	输入	参数 (M)	FLOPs (G)	吞吐量 (img/s)	准确率 (%)	备注
UniRepLKNet-A	C	224^2	4.4	0.6	5942	77.0	ImageNet-1K
UniRepLKNet-F	C	224^2	6.2	0.9	5173	78.6	ImageNet-1K
UniRepLKNet-P	C	224^2	10.7	1.6	3949	80.2	ImageNet-1K
UniRepLKNet-N	C	224^2	18.3	2.8	2807	81.6	ImageNet-1K
UniRepLKNet-T	C	224^2	31	4.9	1804	83.2	ImageNet-1K
UniRepLKNet-S	C	224^2	56	9.1	1265	83.9	ImageNet-1K
UniRepLKNet-B	C	224^2	98	/m	/	/
UniRepLKNet-L	C	224^2	218	/	/	/
UniRepLKNet-XL	C	384^2	386	/	/	87.4	Largest variant (ImageNet via 384^2)

UniRepLKNet 在各变体中实现 ImageNet 的 top-1 准确率高达 83.9–87.9，且吞吐量与对比模型相比具有竞争力或更优。
在 ImageNet 上，UniRepLKNet-A/F 在准确率上优于 ConvNeXt V2-A/F 且运行更快；UniRepLKNet-P/N 超越 FastViT-T12/S12 和 ConvNeXt V2 P/N。
在目标检测和分割方面，UniRepLKNet 变体在 COCO 的 AP/box 和 ADE20K 的 AP/mask，以及 mIoU 的表现，优于若干 ViT 与大核基线。
用小核来扩展深度（LarK vs SmaK 块）带来更好的速度–精度折中；在 Stage 3 使用 9 个 LarK 块达到准确率与吞吐量的平衡。
UniRepLKNet 通过对时间序列预测和音频识别应用相同的骨干网络、并使用模态特定的嵌入映射，展示了通用感知能力，在 GFS 温度与风速预测上达到最先进水平。
在各模态上，UniRepLKNet 的性能优于或接近专用架构，同时在 GPU 上保持高吞吐量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。