QUICK REVIEW

[論文レビュー] Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation

Yuhui Yuan, Xiaokang Chen|arXiv (Cornell University)|Sep 24, 2019

Advanced Neural Network Applications参考文献 78被引用数 473

ひとこと要約

本論文は semantic segmentation のための object-contextual representations (OCR) を導入し、学習されたオブジェクト領域によってピクセル特徴を集約し、それらを Transformer 風のエンコーダ-デコーダフレームワークで統合して、ベンチマーク全体でセマンティックセグメンテーションの精度を向上させる。

ABSTRACT

In this paper, we address the semantic segmentation problem with a focus on the context aggregation strategy. Our motivation is that the label of a pixel is the category of the object that the pixel belongs to. We present a simple yet effective approach, object-contextual representations, characterizing a pixel by exploiting the representation of the corresponding object class. First, we learn object regions under the supervision of ground-truth segmentation. Second, we compute the object region representation by aggregating the representations of the pixels lying in the object region. Last, % the representation similarity we compute the relation between each pixel and each object region and augment the representation of each pixel with the object-contextual representation which is a weighted aggregation of all the object region representations according to their relations with the pixel. We empirically demonstrate that the proposed approach achieves competitive performance on various challenging semantic segmentation benchmarks: Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff. Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff. Our submission "HRNet + OCR + SegFix" achieves 1-st place on the Cityscapes leaderboard by the time of submission. Code is available at: https://git.io/openseg and https://git.io/HRNet.OCR. We rephrase the object-contextual representation scheme using the Transformer encoder-decoder framework. The details are presented in~Section3.3.

研究の動機と目的

ラベルを対応するオブジェクトクラスとして扱うことにより、ピクセルラベリングの文脈集約を促進する。
ソフトなオブジェクト領域とその領域表現を学習する object-contextual representations を提案する。
ピクセルとオブジェクト領域との関係を介して、重み付きのオブジェクト領域表現でピクセル表現を増強する。
Cityscapes、ADE20K、LIP、PASCAL-Context、COCO-Stuff、COCO panoptic タスクで高い性能を示す。

提案手法

バックボーン特徴から学習された粗いソフトセグメンテーションにより、各クラスに対応するソフトなオブジェクト領域 M1,...,MK を形成する。
正規化された領域所属 tilde{m}_{ki} で重み付けされたピクセル特徴 x_i を集約して object region representations f_k を計算する。
x_i と f_k の二次関数 kappa(x_i, f_k) の双線形関数のソフトマックスとして pixel-object region relations w_{ik} を計算し、object-contextual representation y_i を得る。
小さなニューラル変換を介して、元のピクセル特徴 x_i と object-contextual representation y_i を融合して augmented pixel features z_i を形成する。
Segmentation Transformer 内で OCR を再定式化する: デコーダーのクロスアテンションにおけるオブジェクト領域セレクタとして K カテゴリクエリを使用して M_k と f_k を生成し、エンコーダーのクロスアテンションでオブジェクト領域表現をピクセルごとの予測へ統合する。
バックボーンの選択肢には dilated ResNet-101 または HRNet-W48 を含み、OCR モジュールはオブジェクト領域の監督と最終セグメンテーションの両方に対してピクセル毎のクロスエントロピー損失を用いてエンドツーエンドで訓練される。

実験結果

リサーチクエスチョン

RQ1ピクセルとオブジェクト領域の関係を明示的にモデル化することで、object-contextual representations は semantic segmentation の精度を改善できるか？
RQ2ソフトなオブジェクト領域とそれらの領域表現は、ピクセルごとの分類精度にどう影響するか？
RQ3Transformer風のクロスアテンション機構は、セグメンテーションタスクにおいてOCR の概念を効果的に実装しているか？
RQ4OCR の効率性と精度のトレードオフは、マルチスケールおよび関連文脈手法と比べてどうか？

主な発見

OCR は Cityscapes、ADE20K、LIP、PASCAL-Context、COCO-Stuff 全体で multi-scale (PPM/ASPP) および Relational Context のベースラインを上回る。
オブジェクト領域の監督とピクセル–領域の関係推定の両方が性能向上に寄与する。
本アプローチは複数のベンチマークで競争力のある、または最先端の結果を達成し、複数の Relational および Multi-scale context 手法と比較して、メモリ、FLOPs、実行時間の点で有利な効率性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。