QUICK REVIEW

[논문 리뷰] Semantic Segmentation using Vision Transformers: A survey

Hans Thisanke, Chamli Deshan|arXiv (Cornell University)|2023. 05. 05.

Advanced Neural Network Applications인용 수 15

한 줄 요약

이 설문은 Vision Transformer (ViT) 아키텍처를 시맨틱 세분화를 위해 검토하고, SETR, Swin Transformer, Segmenter, SegFormer, PVT와 ADE20K 및 Cityscapes 같은 벤치마크 데이터셋을 비교하며 데이터 전략과 손실 함수에 대해 논의합니다.

ABSTRACT

Semantic segmentation has a broad range of applications in a variety of domains including land coverage analysis, autonomous driving, and medical image analysis. Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation. Even though ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection since ViT is not a general purpose backbone due to its patch partitioning scheme. In this survey, we discuss some of the different ViT architectures that can be used for semantic segmentation and how their evolution managed the above-stated challenge. The rise of ViT and its performance with a high success rate motivated the community to slowly replace the traditional convolutional neural networks in various computer vision tasks. This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets. This will be worthwhile for the community to yield knowledge regarding the implementations carried out in semantic segmentation and to discover more efficient methodologies using ViTs.

연구 동기 및 목표

ViT 기반 아키텍처가 시맨틱 세분화에서의 밀집 예측 문제를 어떻게 해결하는지 평가한다.
아키텍처 유형(순수 ViT 대 하이브리드)과 세분화 정확도 및 효율성을 위한 디코딩 헤드를 비교한다.
제한된 라벨 데이터를 가진 ViT를 가능하게 하는 데이터 관련 전략(전이 학습, 자기지도 학습)을 식별한다.
향후 ViT 세분화 연구를 안내하기 위해 일반적으로 사용되는 손실 함수와 벤치마크를 요약한다.

제안 방법

ViT 기반 세분화 아키텍처의 분류를 제시한다(예: SETR, Swin Transformer, Segmenter, SegFormer, PVT).
계층적 백본, 패치 병합, 효율적인 self-attention 등 계산량을 줄이기 위한 아키텍처적 적응을 논의한다.
벤치마크 결과와 데이터셋 사용 사례를 강조한다(ADE20K, Cityscapes, PASCAL-Context 등).
세분화에서 ViT에 대한 자기지도 학습 및 전이 학습을 포함한 실용적인 데이터 전략을 설명한다.
손실 함수들(cross-entropy, weighted cross-entropy, focal loss, Dice/IoU losses)와 이들이 세분화 정확도에 미치는 영향을 검토한다.

Figure 1: Architecture of the Vision Transformer. The model splits an image into a number of fixed-size patches and linearly embeds them with position embeddings (left). Then the result is fed into a standard transformer encoder (right). Adapted from [ 2 ] .

실험 결과

연구 질문

RQ1시맨틱 세분화를 위해 어떤 ViT 기반 아키텍처가 제안되었으며 표준 데이터셋에서 어떻게 성능을 보이는가?
RQ2설계 선택(백본 유형, 디코더 설계, 패치 크기)이 세분화 정확도와 효율성에 어떤 영향을 미치는가?
RQ3데이터 전략(감독 학습, 자기지도 학습, 전이 학습) 중 어떤 것이 ViT의 데이터-허기 특성을 세분화 작업에서 가장 잘 완화하는가?
RQ4데이터셋 전반에 걸쳐 ViTs를 이용한 픽셀 단위 세분화에 가장 효과적인 손실 함수는 무엇인가?

주요 결과

Swin Transformer는 계층적이며 선형 복잡도 주의 메커니즘으로 강력한 결과를 얻으며; 인용된 연구에서 ADE20K 검증 데이터에서 53.5% mIoU를 보고한다.
Segmenter는 ViT 백본과 마스크 트랜스포머 디코더를 사용하여 글로벌 컨텍스트를 활용하고 CNN 기반 방법보다 향상된 세분화를 얻는다.
SegFormer는 경량 MLP 디코더와 위치 인코딩-제거 디자인의 계층적 인코더를 사용하여 경쟁력 있는 결과와 강건성을 제공하며, B0에서 B5까지의 변형이 있다.
SETR은 세분화를 위한 순수 Transformer 인코더를 도입하며, SETR-PUP 및 SETR-MLA와 같은 변형이 ADE20K와 Pascal Context에서의 성능을 보여준다.
PVT는 해상도와 계산을 균형 있게 조절하는 점진적 피라미드 백본을 제공하여 밀집 예측 작업의 효율성을 향상시킨다.

Figure 2: The general pipeline of self-supervised learning. The trained weights from solving a pretext task are applied to solve some downstream tasks.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.