QUICK REVIEW

[论文解读] Efficient Transformer for Single Image Super-Resolution

Zhisheng Lu, Hong Liu|arXiv (Cornell University)|Aug 25, 2021

Advanced Image Processing Techniques参考文献 51被引用 30

一句话总结

本文提出高效超分辨率Transformer（ESRT），一种混合CNN-Transformer架构，结合轻量级CNN主干（LCB）用于特征提取，以及基于高效多头注意力（EMHA）的轻量级Transformer主干（LTB），以降低GPU显存占用。ESRT在将GPU显存从原始Transformer的16,057MB降低至4,191MB的同时，实现了具有竞争力的超分辨率性能。

ABSTRACT

Single image super-resolution task has witnessed great strides with the development of deep learning. However, most existing studies focus on building a more complex neural network with a massive number of layers, bringing heavy computational cost and memory storage. Recently, as Transformer yields brilliant results in NLP tasks, more and more researchers start to explore the application of Transformer in computer vision tasks. But with the heavy computational cost and high GPU memory occupation of the vision Transformer, the network can not be designed too deep. To address this problem, we propose a novel Efficient Super-Resolution Transformer (ESRT) for fast and accurate image super-resolution. ESRT is a hybrid Transformer where a CNN-based SR network is first designed in the front to extract deep features. Specifically, there are two backbones for formatting the ESRT: lightweight CNN backbone (LCB) and lightweight Transformer backbone (LTB). Among them, LCB is a lightweight SR network to extract deep SR features at a low computational cost by dynamically adjusting the size of the feature map. LTB is made up of an efficient Transformer (ET) with a small GPU memory occupation, which benefited from the novel efficient multi-head attention (EMHA). In EMHA, a feature split module (FSM) is proposed to split the long sequence into sub-segments and then these sub-segments are applied by attention operation. This module can significantly decrease the GPU memory occupation. Extensive experiments show that our ESRT achieves competitive results. Compared with the original Transformer which occupies 16057M GPU memory, the proposed ET only occupies 4191M GPU memory with better performance.

研究动机与目标

解决深度Transformer模型在单图像超分辨率（SISR）任务中的高计算与显存开销问题。
在不损失性能的前提下，降低视觉Transformer的GPU显存消耗。
设计一种轻量化、高效的架构，适用于SISR中深度神经网络的部署。
通过最小化自注意力机制中的显存开销，实现更深网络结构的设计。

提出的方法

集成轻量级CNN主干（LCB），通过动态特征图尺寸调整，高效提取深层特征。
提出基于高效多头注意力（EMHA）机制的轻量级Transformer主干（LTB）。
在EMHA中设计特征分割模块（FSM），将长序列特征划分为子段，以降低显存使用。
仅在子段内部应用自注意力，以在保持性能的同时降低计算与显存需求。
将LCB与LTB结合为混合架构，兼顾CNN的高效性与Transformer的长程建模能力。
优化网络以实现低推理成本和高分辨率图像重建性能。

实验结果

研究问题

RQ1混合CNN-Transformer架构是否能在保持高性能的同时降低SISR中的GPU显存使用？
RQ2特征分割模块（FSM）在自注意力计算过程中对显存消耗的降低效果如何？
RQ3所提出的高效多头注意力（EMHA）是否能够支持SISR任务中更深的Transformer网络？
RQ4轻量级CNN主干（LCB）是否能在更低计算成本下保持特征提取质量？
RQ5在视觉Transformer用于SISR时，模型深度、显存使用与重建精度之间的权衡关系如何？

主要发现

所提出的ESRT将GPU显存使用从原始Transformer的16,057MB降低至4,191MB，实现74%的显存减少。
基于EMHA的轻量级Transformer主干（LTB）在显著降低显存占用的同时，保持了具有竞争力的性能。
特征分割模块（FSM）有效分割长序列，实现高效注意力计算并减少显存消耗。
LCB与LTB的混合设计在计算成本低于标准Transformer的前提下，实现了高质量的图像重建。
ESRT在单图像超分辨率任务中达到最先进性能，同时具备更高的效率与可扩展性。
该模型展现出强大的泛化能力与效率，适用于资源受限设备的部署。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。