QUICK REVIEW

[論文レビュー] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Liang-Chieh Chen, Yukun Zhu|arXiv (Cornell University)|Feb 7, 2018

Advanced Neural Network Applications参考文献 65被引用数 4,542

ひとこと要約

この論文は DeepLabv3 をエンコーダ-デコーダ構造（DeepLabv3+）へ拡張し、軽量なデコーダを追加し atrous separable convolutions を使用して、ポスト処理なしで PASCAL VOC 2012 と Cityscapes で最先端のセマンティックセグメンテーションを達成する。

ABSTRACT

Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0\% and 82.1\% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at \url{https://github.com/tensorflow/models/tree/master/research/deeplab}.

研究の動機と目的

空間ピラミッドプーリングの強みとエンコーダ-デコーダ構造をセマンティックセグメンテーションのために組み合わせる。
atrous convolution を用いてエンコーダの特徴解像度を controllable に制御し、精度と速度のトレードオフを実現する。
エンコーダ特徴を再利用しつつ境界を細かく補正するデコーダを導入する。
speed と accuracy を向上させるために depthwise separable convolutions（Xception に基づく）を採用する。

提案手法

境界を細かく補正するために、単純でありながら効果的なデコーダを追加して DeepLabv3 を拡張する。
エンコーダに atrous (dilated) convolution を適用して特徴密度と受容野を制御する。
ASPP とデコーダモジュールの両方に depthwise separable convolutions (atrous separable convolution) を組み込む。
計算量を削減するため、depthwise separable convolutions を用いた aligned Xception バックボーンを適応させる。
VOC 2012 で end-to-end 学習を行い、精度と速度のバランスを取るために 16x/8x output stride を適用する。
Public TensorFlow implementation provided at the DeepLab repository.

実験結果

リサーチクエスチョン

RQ1ASPP を活用した単純なデコーダを備えたエンコーダ-デコーダ構造は、ポスト処理なしで境界のシャープさを改善できるか。
RQ2atrous separable convolution と Xception ベースのバックボーンを用いることは、セマンティックセグメンテーションの精度と速度にどのような影響を与えるか。
RQ3提案されたデコーダ設計は、境界の精度と標準ベンチマークでの総合的な mIoU にどう影響するか。

主な発見

提案されたデコーダを備えた DeepLabv3+ は PASCAL VOC 2012 test set で 89.0% mIoU を達成 (VOC 2012 test results with JFT pretraining)。
Cityscapes では、DeepLabv3+ は post-processing なしで test set で 82.1% mIoU、validation では backbone と設定に応じて 79.55–82.1% の範囲。
バックボーンとして Xception を用い、atrous separable convolution を併用することで Multiply-Adds を 33–41% 削減しつつ同等の mIoU を達成。
デコーダ設計の選択は naive bilinear upsampling より改善をもたらし、特に物体境界近くで trimap analyses において顕著な利得。
COCO/JFT pretraining を用いると、モデルは VOC 2012 test で 89.0%、Cityscapes で 82.1% をファインチューニング後に達成。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。