QUICK REVIEW

[논문 리뷰] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Enze Xie, Wenhai Wang|arXiv (Cornell University)|2021. 05. 31.

Advanced Image and Video Retrieval Techniques인용 수 3,230

한 줄 요약

SegFormer는 위치 인코딩 없이 계층적 Transformer 인코더와 경량 All-MLP 디코더를 결합하여 ADE20K, Cityscapes, COCO-Stuff 전반에서 높은 효율성으로 강력한 정확도를 달성합니다.

ABSTRACT

We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.

연구 동기 및 목표

Transformer 인코더와 경량 디코더를 결합한 효율성과 강건성을 갖춘 의미 분할 프레임워크를 동기부여한다.
Dense 예측에 적합한 다중 스케일 특징을 출력하는 위치 인코딩 없이 계층적 Transformer 인코더를 개발한다.
무거운 백본이나 복잡한 모듈 없이 다층 특징을 융합하는 컴팩트한 All-MLP 디코더를 설계한다.
확장 가능한 MiT 인코더로 ADE20K, Cityscapes, COCO-Stuff에서 최첨단 성능과 강건성을 입증한다.

제안 방법

MiT (Mix Transformer) 인코더를 1/4, 1/8, 1/16, 1/32 해상도에서 계층적 특징 맵으로 도입한다.
공간 로컬리티를 잃지 않으면서 다중 스케일 특징을 구축하기 위해 중첩 패치 병합을 사용한다.
복잡도를 O(N^2)에서 O(N^2/R)로 줄이기 위해 시퀀스 축소가 적용된 효율적 자체 주의 메커니즘을 채용한다.
고정 위치 임베딩 없이 위치 정보를 주입하기 위해 3x3 깊이wise 합성곱과 MLP를 결합한 Mix-FFN으로 표준 ViT 스타일 FFN을 대체한다.
다층 특징을 단순한 선형 계층과 MLP를 통해 업샘플링하고 융합하는 가벼운 All-MLP 디코더를 사용해 세그멘테이션 마스크를 생성한다.

실험 결과

연구 질문

RQ1계층적이고 위치 인코딩이 없는 Transformer 인코더가 의미 분할에 적합한 고해상도 다중 스케일 특징을 생성할 수 있는가?
RQ2가벼운 All-MLP 디코더가 다층 Transformer 특징을 융합하여 픽셀 단위 예측의 정확성을 충분히 확보할 수 있는가?
RQ3SegFormer 계열은 표준 의미 분할 벤치마크에서 정확도, 매개변수, FLOPs, 속도 측면에서 어떻게 확장되는가?
RQ4제안된 Mix-FFN과 중첩 패치 병합은 테스트 해상도 변화 및 서로 다른 데이터셋에 대해 견고한가?

주요 결과

SegFormer-B0는 ADE20K에서 3.8M 매개변수와 8.4G FLOPs로 실시간 성능을 강하게 달성하며 여러 지표에서 실시간 상대 방법들을 능가한다.
SegFormer-B5는 Cityscapes 검증에서 84.0% mIoU를 달성하며 이전 최상의 방법들보다 훨씬 작고 빠른 동작을 보인다.
ADE20K에서 SegFormer-B4는 64M 매개변수로 50.3% mIoU를 달성하며 이전 최고를 능가하고 약 5배 더 작다.
SegFormer는 SETR보다 훨씬 작은 모델로 ADE20K에서 새로운 최첨단인 51.8% mIoU를 달성하고 Cityscapes에서 83.8-84.0%를 달성하는 효율적 네트워크를 제시한다.
SegFormer는 자연 오염(Cityscapes-C)에 대해 강건함을 보이며 여러 오염 카테고리에서 이전 방법들을 크게 능가한다.
COCO-Stuff 전반에서 SegFormer-B5는 84.7M 매개변수로 46.7% mIoU를 달성하며 유사한 방법인 SETR보다 약 0.9% 포인트 높고도 더 작다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.