[Paper Review] Spatially-Adaptive Feature Modulation for Efficient Image Super-Resolution
SAFMN introduces a lightweight ViT-like blocks with spatially-adaptive feature modulation and a convolutional channel mixer to achieve competitive SR performance with significantly fewer parameters and memory, outperforming many efficient SR methods in efficiency.
Although numerous solutions have been proposed for image super-resolution, they are usually incompatible with low-power devices with many computational and memory constraints. In this paper, we address this problem by proposing a simple yet effective deep network to solve image super-resolution efficiently. In detail, we develop a spatially-adaptive feature modulation (SAFM) mechanism upon a vision transformer (ViT)-like block. Within it, we first apply the SAFM block over input features to dynamically select representative feature representations. As the SAFM block processes the input features from a long-range perspective, we further introduce a convolutional channel mixer (CCM) to simultaneously extract local contextual information and perform channel mixing. Extensive experimental results show that the proposed method is $3 imes$ smaller than state-of-the-art efficient SR methods, e.g., IMDN, in terms of the network parameters and requires less computational cost while achieving comparable performance. The code is available at https://github.com/sunny2109/SAFMN.
Motivation & Objective
- Motivate efficient SR for low-power devices with limited compute and memory.
- Develop a lightweight network that leverages long-range feature interactions for SR.
- Introduce SAFM and CCM components to fuse global adaptability with local context.
- Demonstrate favorable accuracy–efficiency trade-offs against state-of-the-art lightweight SR models.
Proposed method
- Use a ViT-like architecture to enable long-range feature interactions through a multi-scale spatially-adaptive feature modulation (SAFM) block.
- Introduce a convolutional channel mixer (CCM) to encode local context and perform channel mixing efficiently.
- Stack feature mixing modules (FMMs) that combine SAFM and CCM with LayerNorm-based processing.
- Train with a combination of L1 loss and an FFT-based frequency loss to enhance high-frequency reconstruction.
- Utilize a light upsampler and a global residual connection to reconstruct HR images.
- Employ a feature pyramid with adaptive max pooling to generate multi-scale features for SAFM.
Experimental results
Research questions
- RQ1Can a lightweight SAFM-based module achieve comparable SR performance to heavier models?
- RQ2Does combining SAFM with a compact CCM provide an effective balance of accuracy and efficiency?
- RQ3What is the impact of multi-scale representation and normalization choices on SR performance and stability?
- RQ4How does SAFMN compare to state-of-the-art efficient SR models in terms of parameters, FLOPs, and memory usage?
Key findings
- SAFMN achieves competitive SR performance with significantly fewer parameters and memory usage than state-of-the-art efficient SR methods.
- On x4 SR, SAFMN uses about 85% fewer parameters than CARN, 66% fewer than IMDN, and 42% fewer than ShuffleMixer, with 60%, 29%, and 71% fewer activations respectively.
- The multi-scale SAFM representation improves reconstruction by enabling long-range feature interactions with lower memory.
- The CCM effectively encodes local context and channel mixing with lower memory overhead than alternatives like inverted residual blocks.
- LayerNorm is essential for stable training and better performance compared to BN variants and other normalizations.
- Ablation shows SAFM and CCM components contribute cumulatively to performance gains over the baseline.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.