QUICK REVIEW

[논문 리뷰] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Enze Xie, Wenhai Wang|CaltechAUTHORS (California Institute of Technology)|2021. 05. 31.

Advanced Neural Network Applications참고 문헌 71인용 수 848

한 줄 요약

SegFormer는 계층적이며 위치 인코딩이 없는 Transformer 인코더와 가벼운 All-MLP 디코더를 통해 ADE20K, Cityscapes, COCO-Stuff 전반에서 시맨틱 분할의 현저한 효율성과 정확도를 달성합니다.

ABSTRACT

We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.

연구 동기 및 목표

다양한 입력 해상도에 호환되는 경량화되고 정확한 시맨틱 분할 프레임워크를 동기화하려는 목표.
위치 인코딩 없이 다중 스케일 특징을 출력하는 계층형 트랜스포머 인코더를 개발한다.
시맨틱 분할을 위해 다중 레벨 특징을 융합하는 경량 All-MLP 디코더를 제안한다.
표준 데이터셋에서 매개변수 수, FLOPs, 속도 측면에서 개선된 효율성과 견고성을 보여준다.

제안 방법

MiT(Mix Transformer) 인코더를 도입하여 1/4, 1/8, 1/16, 1/32 해상도에서 계층적 특징 출력을 제공한다.
해상도 의존 보간 문제를 피하기 위해 인코더에서 위치 임베딩을 제거한다.
복잡도를 O(N^2)에서 O(N^2/R)으로 줄이기 위해 시퀀스 길이 감소를 적용한 효율적인 자기 주의 메커니즘을 사용한다.
전통적인 CNN/트랜스포머 디코더를 대체하여 간단한 MLP 층으로 다중 레벨 특징을 융합하는 경량 All-MLP 디코더를 도입한다.
고정 위치 인코딩 없이 위치 정보를 주입하기 위해 3x3 깊이별 컨볼루션과 MLP의 혼합인 Mix-FFN을 FFN에 도입한다.
정확도와 효율성의 균형을 맞추기 위해 B0에서 B5까지 MiT 모델 패밀리를 제공한다.

실험 결과

연구 질문

RQ1위치 인코딩이 없는 계층형 트랜스포머 인코더가 테스트 해상도 강건성을 유지하면서 세그먼테이션에 적합한 다중 스케일 특징을 생성할 수 있는가?
RQ2경량 All-MLP 디코더가 다양한 레벨의 트랜스포머 특징을 효과적으로 융합하여 계산 비용을 줄이면서 강력한 분할 성능을 산출하는가?
RQ3모델 크기(B0–B5)와 디코더 채널 차원(C)이 표준 세그먼테이션 벤치마크에서 정확도, FLOPs, 지연에 어떤 영향을 미치는가?
RQ4Mix-FFN은 고정 위치 임베딩에 비해 테스트 시간 해상도 변화에 대한 강건성 측면에서 실행가능한 대안인가?

주요 결과

SegFormer-B0는 3.8M 매개변수와 8.4 GFLOPs로 실시간에 근접한 성능을 달성하며 다양한 데이터셋에서 강력한 mIoU를 제공한다.
SegFormer-B5는 Cityscapes 검증에서 84.0% mIoU, ADE20K에서 51.8% mIoU를 달성하며, SETR과 같은 기존 방법보다 훨씬 효율적이다.
ADE20K에서 SegFormer-B4는 64M 매개변수로 50.3% mIoU를 달성하며 정확도와 효율성 측면에서 기존 방법을 능가한다.
SegFormer는 Cityscapes-C에서 강건성이 뛰어나며 다수의 오염 시나리오에서 이전 방법들을 능가한다(예: 가우시안 노이즈에서 상대적 개선 최대 588%).
Mix-FFN(위치 인코딩 없는) 인코더는 고정 위치 임베딩에 비해 테스트 해상도 변화에 대한 강건성이 더 좋다.
SegFormer의 All-MLP 디코더는 트랜스포머로 유도된 특징의 이점을 활용해 무거운 모듈 없이 더 큰 유효 수용 영역을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.