QUICK REVIEW

[논문 리뷰] SegViT: Semantic Segmentation with Plain Vision Transformers

Bowen Zhang, Zhi Tian|arXiv (Cornell University)|2022. 10. 12.

Advanced Neural Network Applications인용 수 75

한 줄 요약

SegViT는 Attention-to-Mask (ATM) 디코더를 도입하여 일반 Vision Transformers를 활용한 시맨틱 분할에서 Shrunk 백본 설계로 계산량을 줄이면서 SOTA에 근접하거나 경쟁력 있는 성능을 달성합니다.

ABSTRACT

We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.

연구 동기 및 목표

일반 Vision Transformers (ViTs)를 이용한 시맨틱 분할의potential을 탐구한다.
마스크를 어텐션 맵으로부터 도출하기 위한 Attention-to-Mask (ATM) 모듈을 제안한다.
다층 ViT에 ATM를 Cascade하여 다층 정보를 융합한 분할을 수행한다.
계산량을 줄이기 위한 Shrunk 백본(쿼리 기반 다운샘플링 및 업샘플링)을 도입한다.
ADE20K, COCO-Stuff-10K, PASCAL-Context에서 SOTA 또는 경쟁력 있는 결과를 보여준다.

제안 방법

클래스 토큰 쿼리를 정의하고 백본 특징 맵과의 교차 주의를 사용하여 유사도 맵의 시그모이드 값을 통해 클래스별 마스크를 생성한다.
업데이트된 클래스 토큰에 선형 변환과 소프트맥스를 적용하여 클래스 예측을 계산한다.
여러 ViT 층의 ATM 출력을 융합하여 최종 분할 예측을 형성한다.
계산 비용을 절감하는 Shrunk를 도입하여 쿼리 기반 다운샘플링(QD)과 쿼리 기반 업샘플링(QU)으로 GFLOPs를 최대 약 40%까지 감소시킨다.
다양한 손실 항으로 학습한다: L_overall = L_cls + lambda_focal L_IoU + lambda_dice L_dice, 층을 가로질러 클래스 토큰과 마스크를 모두 감독한다.

실험 결과

연구 질문

RQ1일반 ViT 백본을 어텐션 기반 마스크 추론 방식으로 밀집 시맨틱 분할에 효과적으로 사용할 수 있는가?
RQ2교차 주의도(similarity maps)로 얻은 마스크를 이용하는 것이 ViT 특징에서의 픽셀 단위 디코딩보다 분할 품질을 향상시키는가?
RQ3다층 ATM 캐스케이드와 Shrunk 백본이 ViT로 분할 시 계산을 줄이면서 정확도를 유지할 수 있는가?

주요 결과

ATM이 있는 SegViT는 ADE20K에서 ViT-Large 백본으로 55.2% mIoU를 달성하고 Shrunk로 55.1%를 달성하여 비용이 감소한 상태에서 경쟁력 있는 성능을 보여준다.
ADE20K에서 SegViT with ViT-Large는 여러 ViT 기반 방법을 능가하고 특정 설정에서 SOTA를 초과하거나 근접하다.
SegViT-Shrunk는 계산 비용을 약 40% 감소시키며(373.5 GFLOPs 대 637.9 GFLOPs), 성능 손실은 미미하다.
다층 ATM 입력은 일관된 mIoU 상승을 가져다주며(예: 세 층 사용 시 ADE20K에서 최대 +1.7% 증가).
SegViT는 Pascal-Context(60 클래스에서 65.3% mIoU)와 COCO-Stuff-10K(ViT-Large에서 50.3% mIoU)에서 강한 결과를 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.