QUICK REVIEW

[논문 리뷰] Efficient Transformer for Single Image Super-Resolution

Zhisheng Lu, Hong Liu|arXiv (Cornell University)|2021. 08. 25.

Advanced Image Processing Techniques참고 문헌 51인용 수 30

한 줄 요약

이 논문은 특징 추출을 위한 경량 컨volution 신경망 백본(LCB)과 효율적인 다중 헤드 어텐션(EMHA)을 갖춘 경량 트랜스포머 백본(LTB)을 조합한 하이브리드 CNN-Transformer 아키텍처인 효율적 슈퍼해상도 트랜스포머(ESRT)를 제안한다. 이는 GPU 메모리 사용량을 줄이는 데 성공했으며, 원본 트랜스포머의 16,057MB에서 4,191MB로 감소시켰다. 이로 인해 경쟁력 있는 슈퍼해상도 성능를 달성하였다.

ABSTRACT

Single image super-resolution task has witnessed great strides with the development of deep learning. However, most existing studies focus on building a more complex neural network with a massive number of layers, bringing heavy computational cost and memory storage. Recently, as Transformer yields brilliant results in NLP tasks, more and more researchers start to explore the application of Transformer in computer vision tasks. But with the heavy computational cost and high GPU memory occupation of the vision Transformer, the network can not be designed too deep. To address this problem, we propose a novel Efficient Super-Resolution Transformer (ESRT) for fast and accurate image super-resolution. ESRT is a hybrid Transformer where a CNN-based SR network is first designed in the front to extract deep features. Specifically, there are two backbones for formatting the ESRT: lightweight CNN backbone (LCB) and lightweight Transformer backbone (LTB). Among them, LCB is a lightweight SR network to extract deep SR features at a low computational cost by dynamically adjusting the size of the feature map. LTB is made up of an efficient Transformer (ET) with a small GPU memory occupation, which benefited from the novel efficient multi-head attention (EMHA). In EMHA, a feature split module (FSM) is proposed to split the long sequence into sub-segments and then these sub-segments are applied by attention operation. This module can significantly decrease the GPU memory occupation. Extensive experiments show that our ESRT achieves competitive results. Compared with the original Transformer which occupies 16057M GPU memory, the proposed ET only occupies 4191M GPU memory with better performance.

연구 동기 및 목표

단일 이미지 슈퍼해상도(SISR)에서 깊이 있는 트랜스포머 모델의 높은 계산 및 메모리 비용을 해결하기 위해.
성능을 희생시키지 않고 시각 트랜스포머의 GPU 메모리 소비를 줄이기 위해.
SISR에서 깊은 네트워크 배포에 적합한 경량이고 효율적인 아키텍처를 설계하기 위해.
자기어텐션 메커니즘의 메모리 오버헤드를 최소화하여 더 깊은 네트워크 설계를 가능하게 하기 위해.

제안 방법

동적 특징맵 크기 조정 기능을 갖춘 경량 컨volution 신경망 백본(LCB)을 통합하여 깊은 특징을 효율적으로 추출하기 위해.
효율적인 다중 헤드 어텐션(EMHA) 메커니즘을 기반으로 한 경량 트랜스포머 백본(LTB)을 제안하기 위해.
긴 특징 시퀀스를 하위 세그먼트로 나누어 메모리 사용량을 줄이기 위해 EMHA 내 특징 분할 모듈(FSM)을 설계하기 위해.
자기어텐션을 하위 세그먼트 내에서만 적용하여 성능를 유지하면서도 계산 및 메모리 요구량을 낮추기 위해.
CNN의 효율성과 트랜스포머의 장거리 모델링 능력을 동시에 활용하기 위해 LCB와 LTB를 하이브리드 아키텍처로 조합하기 위해.
저비용 추론과 고해상도 이미지 복원을 최적화하기 위해.

실험 결과

연구 질문

RQ1하이브리드 CNN-Transformer 아키텍처는 SISR에서 높은 성능를 유지하면서 GPU 메모리 사용량을 줄일 수 있는가?
RQ2특징 분할 모듈(FSM)은 자기어텐션 계산 중 메모리 소비를 얼마나 효과적으로 줄이는가?
RQ3제안된 효율적인 다중 헤드 어텐션(EMHA)은 SISR 작업에서 더 깊은 트랜스포머 네트워크를 가능하게 하는가?
RQ4경량 컨볼루션 신경망 백본(LCB)은 낮은 계산 비용으로도 특징 추출 품질을 유지할 수 있는가?
RQ5시각 트랜스포머의 SISR에 대한 모델 깊이, 메모리 사용량, 복원 정확도 사이의 상충 관계는 어떠한가?

주요 결과

제안된 ESRT는 원본 트랜스포머의 16,057MB에서 4,191MB로 GPU 메모리 사용량을 74% 감소시켰다.
EMHA를 활용한 경량 트랜스포머 백본(LTB)은 메모리 사용량을 크게 낮추면서도 경쟁력 있는 성능를 유지한다.
특징 분할 모듈(FSM)은 긴 시퀀스를 효과적으로 분할하여 메모리 소비를 줄인 채로 효율적인 어텐션 계산을 가능하게 한다.
LCB와 LTB의 하이브리드 설계는 표준 트랜스포머보다 낮은 계산 비용으로 고품질의 이미지 복원을 가능하게 한다.
ESRT는 개선된 효율성과 확장성으로 단일 이미지 슈퍼해상도 분야에서 최신 기술 수준의 성능를 달성한다.
모델은 강력한 일반화 능력과 효율성을 보이며, 자원 제약이 있는 장치에의 배포에 적합하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.