[论文解读] InceptionNeXt: When Inception Meets ConvNeXt
InceptionNeXt 将大核深度卷积分解为四个并行分支(包括一个恒等分支),以提高速度同时保持或提高准确性,相比 ConvNeXt 产生更快的训练/推理,并在 ImageNet 和 ADE20K 上取得优异结果。
Inspired by the long-range modeling ability of ViTs, large-kernel convolutions are widely studied and adopted recently to enlarge the receptive field and improve model performance, like the remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such depthwise operator only consumes a few FLOPs, it largely harms the model efficiency on powerful computing devices due to the high memory access costs. For example, ConvNeXt-T has similar FLOPs with ResNet-50 but only achieves ~60% throughputs when trained on A100 GPUs with full precision. Although reducing the kernel size of ConvNeXt can improve speed, it results in significant performance degradation, which poses a challenging problem: How to speed up large-kernel-based CNN models while preserving their performance. To tackle this issue, inspired by Inceptions, we propose to decompose large-kernel depthwise convolution into four parallel branches along channel dimension, i.e., small square kernel, two orthogonal band kernels, and an identity mapping. With this new Inception depthwise convolution, we build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance. For instance, InceptionNeXt-T achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K. We anticipate InceptionNeXt can serve as an economical baseline for future architecture design to reduce carbon footprint. Code is available at https://github.com/sail-sg/inceptionnext.
研究动机与目标
- Motivate faster large-kernel CNNs that retain high accuracy in vision models.
- Introduce an efficient depthwise convolution operator inspired by Inception that reduces memory access costs.
- Develop the InceptionNeXt family of models as economical baselines for CNN design.
提出的方法
- Introduce Inception depthwise convolution that splits channels and processes them with four parallel branches: small 3x3 square, horizontal band 1xk, vertical band kx1, and identity.
- Formally decompose large-kernel depthwise conv into four branches and concatenate outputs to form the feature map.
- Embed the Inception depthwise module into a MetaNeXt/ConvNeXt-like block to create the InceptionNeXt backbone.
- Configure four-stage architectures with channel-dimension scaling and an MLP ratio tuned for performance and speed.
- Provide ablations to study branch importance, band kernel sizes, and convolution branch ratio.
实验结果
研究问题
- RQ1Can large-kernel depthwise convolutions be made efficient without sacrificing accuracy in CNNs?
- RQ2Does an Inception-style decomposition of depthwise convolutions offer better speed-accuracy trade-offs than standard ConvNeXt-like blocks?
- RQ3Are InceptionNeXt backbones competitive with state-of-the-art ViT/CNN hybrids on ImageNet-1K and ADE20K?
主要发现
| Model | Params (M) | MACs (G) | Train Throughput (imgs/s) | Inference Throughput (imgs/s) | Top-1 (%) | Notes |
|---|---|---|---|---|---|---|
| InceptionNeXt-T (Ours) | 28 | 4.2 | 901 | 2900 | 82.3 | Baseline for ablative study; +0.2 over ConvNeXt-T per Table 4. |
| InceptionNeXt-S (Ours) | 49 | 8.4 | 521 | 1750 | 83.5 | Higher throughput and accuracy vs ConvNeXt-S. |
| InceptionNeXt-B (Ours) | 87 | 14.9 | 375 | 1244 | 84.0 | Best trade-off among tested sizes; +0.2 over ConvNeXt-B. |
- InceptionNeXt-T achieves 0.2% higher top-1 accuracy than ConvNeXt-T while delivering 1.6x training throughput and 1.2x inference throughput on A100 GPUs.
- Across sizes, InceptionNeXt consistently improves or matches ConvNeXt in accuracy with notable speedups in training and competitive or better throughput.
- On ImageNet-1K, InceptionNeXt-S and InceptionNeXt-B deliver higher top-1 accuracy than their ConvNeXt isotropic counterparts with up to ~0.4% gain and substantial throughput advantages (e.g., InceptionNeXt-B: 84.0% Top-1 vs 83.8% ConvNeXt-B).
- Ablation results show removing either horizontal/vertical band branches or the small 3x3 branch reduces accuracy, while the parallel band branches provide a speed/accuracy balance.
- In semantic segmentation on ADE20K, InceptionNeXt backbones outperform Swin and ConvNeXt across model sizes with higher mIoU (e.g., InceptionNeXt-B: 46.4 mIoU with Semantic FPN).
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。