QUICK REVIEW

[论文解读] RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

Xiaohan Ding, Chunlong Xia|arXiv (Cornell University)|May 5, 2021

Advanced Neural Network Applications参考文献 31被引用 65

一句话总结

RepMLP 引入一个训练时块，包含全局感知器、分区感知器和局部感知器，可以合并为用于推理的三层全连接层，在 ImageNet 及相关任务上以较低 FLOPs 和更快速度实现具备竞争力的精度，相较于传统 CNN。

ABSTRACT

We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition). The code and models are available at https://github.com/DingXiaoH/RepMLP.

研究动机与目标

Motivate combining the global capacity and positional perception of fully-connected layers with local priors from convolutions for image recognition.
Develop a training-time RepMLP block (Global Perceptron, Partition Perceptron, Local Perceptron) and a simple, platform-agnostic method to merge convs into FCs for inference.
Demonstrate performance gains over traditional CNNs on tasks including ImageNet classification, face recognition, and semantic segmentation, with lower FLOPs.
Provide practical guidelines for deploying RepMLP within ResNet-style architectures and show the effects of design choices (partitioning, grouping, kernel sizes).

提出的方法

Introduce Global Perceptron to inject global correlations across partitions of the feature map.
Introduce Partition Perceptron with an FC and BN that operates on partitioned maps to share parameters across partitions.
Introduce Local Perceptron with multiple conv branches (K = 1, 3, 5, 7) and BN, whose outputs are summed with the partition output.
Propose groupwise FC (gFC) to reduce parameter count and implementable via grouped 1x1 conv, enabling scalable modeling of long-range dependencies.
Describe a simple, differentiable procedure to merge conv and BN into a single FC-based inference block (W^(F,p), BN fusion equations) that preserves equivalence with training-time computations.
Explain conversion of the entire RepMLP block into three FC layers for efficient inference.

实验结果

研究问题

RQ1Can an FC-based block with local priors capture both global dependencies and positional information for image tasks?
RQ2Is it feasible to train with conv/BN branches and merge them into an FC without inference-time costs, while improving accuracy and speed?
RQ3What is the impact of partitioning, grouping, and kernel choices on performance across image classification, faces, and segmentation?
RQ4How does RepMLP compare to self-attention and other global-capacity modules in terms of speed and accuracy when deployed on standard benchmarks?

主要发现

Pure MLP on CIFAR-10 with RepMLP achieves 91.11% accuracy with 52.8M FLOPs, approaching CNN performance under certain configurations.
Replacing convs with RepMLP in ResNet-50 on ImageNet (224x224) yields competitive accuracy with lower FLOPs and faster throughput than the vanilla ResNet-50 (e.g., RepMLP-Res50 at 224 shows 78.55% top-1 accuracy with 636 examples/s and 40.87M parameters, versus 77.19% accuracy, 689 examples/s, 25.53M params for ResNet-50).
On 320x320 inputs, RepMLP-Res50 variants achieve higher accuracy and throughput than ResNet-50/ResNet-101 baselines; e.g., RepMLP-Res50 with g8/16 achieves 79.76% top-1 at 312 examples/s, whereas a comparable ResNet-50/101 setup shows lower throughput.
Table comparisons indicate RepMLP variants can substantially reduce FLOPs for similar or improved accuracy relative to standard CNNs (e.g., ResNet-50 vs RepMLP-Res50 with 224 input).
Increasing the grouping and channel-reduction parameters (r, g) allows trade-offs between accuracy, speed, and parameter count, with certain configurations delivering faster speeds and competitive accuracy.
The architecture combines global capacity and positional perception of FCs with local priors from conv branches, offering advantages over non-local/self-attention modules in terms of simplicity and efficiency.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。