QUICK REVIEW

[論文レビュー] Semantic Segmentation using Vision Transformers: A survey

Hans Thisanke, Chamli Deshan|arXiv (Cornell University)|May 5, 2023

Advanced Neural Network Applications被引用数 15

ひとこと要約

このサーベイは Vision Transformer (ViT) アーキテクチャを用いたセマンティックセグメンテーションを調査し、SETR、Swin Transformer、Segmenter、SegFormer、PVT などのモデルを ADE20K や Cityscapes などのベンチマークデータセットで比較し、データ戦略と損失関数について論じる。

ABSTRACT

Semantic segmentation has a broad range of applications in a variety of domains including land coverage analysis, autonomous driving, and medical image analysis. Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation. Even though ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection since ViT is not a general purpose backbone due to its patch partitioning scheme. In this survey, we discuss some of the different ViT architectures that can be used for semantic segmentation and how their evolution managed the above-stated challenge. The rise of ViT and its performance with a high success rate motivated the community to slowly replace the traditional convolutional neural networks in various computer vision tasks. This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets. This will be worthwhile for the community to yield knowledge regarding the implementations carried out in semantic segmentation and to discover more efficient methodologies using ViTs.

研究の動機と目的

ViT ベースのアーキテクチャがセマンティックセグメンテーションにおける密集予測の課題にどのように対処しているかを評価する。
アーキテクチャタイプ（純粋な ViT 対ハイブリッド）とそれらのデコーダーヘッドを、セグメンテーションの精度と効率の観点で比較する。
限られたラベル付きデータで ViTs を可能にするデータ関連戦略（転移学習、自己教師あり学習）を特定する。
今後の ViT セグメンテーション研究を指針とするため、一般的に用いられる損失関数とベンチマークを要約する。

提案手法

ViT ベースのセグメンテーションアーキテクチャの分類法を提示する（例：SETR、Swin Transformer、Segmenter、SegFormer、PVT）。
階層的なバックボーン、パッチ統合、効率的な自己注意など、計算量を削減するためのアーキテクチャ適応を論じる。
ベンチマーク結果とデータセットの使用状況を強調する（ADE20K、Cityscapes、PASCAL-Context など）。
セグメンテーションにおける ViTs のための自己教師あり学習や転移学習を含む実践的なデータ戦略を説明する。
損失関数（クロスエントロピー、重み付きクロスエントロピー、 focal loss、Dice/IoU losses）とそれらがセグメンテーションの精度に与える影響をレビューする。

Figure 1: Architecture of the Vision Transformer. The model splits an image into a number of fixed-size patches and linearly embeds them with position embeddings (left). Then the result is fed into a standard transformer encoder (right). Adapted from [ 2 ] .

実験結果

リサーチクエスチョン

RQ1セマンティックセグメンテーションのために提案された ViT ベースのアーキテクチャは何で、標準データセット全体でどのように性能を発揮しているか。
RQ2設計選択（バックボーンのタイプ、デコーダ設計、パッチサイズ）がセグメンテーションの精度と効率にどのように影響するか。
RQ3データ戦略（監視あり、自己監視あり、転移学習）は、セグメンテーションタスクにおける ViT のデータ大量要求の挙動をどのように最も緩和するか。
RQ4データセットを跨いで、ViT を用いたピクセル単位のセグメンテーションに最も効果的な損失関数はどれか。

主な発見

Swin Transformer は階層的で線形複雑度の注意機構を用いて高い性能を達成し、該当論文の ADE20K val で 53.5% mIoU を報告。
Segmenter は ViT バックボーンとマスクトランスフォーマーデコーダを用い、CNN ベースの手法よりもグローバルな文脈を活用してセグメンテーションを改善。
SegFormer は階層的エンコーダと軽量な MLP デコーダ、および位置エンコード不要設計を採用し、競争力のある結果と堅牢性を提供し、B0 から B5 までのバリアント。
SETR はセグメンテーションのための純粋な Transformer エンコーダを導入し、SETR-PUP や SETR-MLA のようなバリアントが ADE20K や Pascal Context で性能を示している。
PVT は解像度と計算量のバランスを取るための進行的ピラミッドバックボーンを提供し、密な予測タスクの効率を向上させる。

Figure 2: The general pipeline of self-supervised learning. The trained weights from solving a pretext task are applied to solve some downstream tasks.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。