Skip to main content
QUICK REVIEW

[論文レビュー] Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, George Papandreou|arXiv (Cornell University)|Jun 17, 2017
Image Retrieval and Classification Techniques参考文献 2被引用数 7,431
ひとこと要約

この論文は atrous(dilated)畳み込みをセマンティックセグメンテーションに適用することを見直し、DeepLabv3 を提示します。DeepLabv3 は cascaded atrous blocks と image-level features を組み合わせた augmented Atrous Spatial Pyramid Pooling (ASPP) を備え、多段スケールの文脈を捉え、DenseCRF 後処理なしで VOC 2012 において最先端に近い結果を実現します。

ABSTRACT

In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed `DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

研究の動機と目的

  • Motivate and address challenges in semantic segmentation related to reduced feature resolution and multi-scale objects using atrous convolution.
  • Develop architectures that capture multi-scale context via cascaded atrous blocks and parallel atrous branches (ASPP).
  • Augment ASPP with image-level global context features and study training details to improve performance.

提案手法

  • Apply atrous convolution to extract dense features while controlling output resolution (output_stride).
  • Design cascaded atrous convolution blocks to progressively increase receptive field without excessive spatial decimation.
  • Revisit and augment Atrous Spatial Pyramid Pooling (ASPP) with multiple rates, batch normalization, and image-level features to provide global context.
  • Experiment with multi-grid rates within cascaded blocks to enhance long-range context capture.
  • Train with a refined protocol including upsampling logits during training, fine-tuning batch normalization, and larger crop sizes.
  • Evaluate under different output_stride settings and inference strategies (multi-scale, flips) to maximize accuracy.

実験結果

リサーチクエスチョン

  • RQ1How can atrous convolution be restructured to better capture multi-scale context for semantic segmentation?
  • RQ2Does augmenting ASPP with image-level features and careful BN training improve segmentation accuracy over prior DeepLab variants?
  • RQ3What is the impact of cascaded vs parallel multi-rate atrous modules on segmentation performance?
  • RQ4How do training/inference strategies (output_stride, crop size, bootstrapping) affect performance on VOC2012 and Cityscapes?
  • RQ5What gains are achievable with MS-COCO pretraining for the proposed DeepLabv3 architectures?

主な発見

  • DeepLabv3 achieves 85.7% mIOU on PASCAL VOC 2012 test without DenseCRF post-processing.
  • Pretraining on MS-COCO followed by fine-tuning yields 86.9% mIOU on VOC2012 test with the best setup and JFT-300M variants.
  • Augmenting ASPP with image-level features and tuning batch normalization improves VOC2012 val performance; the best ASPP setup reaches 79.77% mIOU with inference-time improvements.
  • On Cityscapes, DeepLabv3 attains 81.3% mIOU on the test set when trained on train_fine alone, and up to 81.3–79.30% depending on inference settings; with multi-scale and flips, accuracy improves further to 79.30% on validation (Cityscapes).
  • Inference strategies (output_stride=8, multi-scale inputs, and left-right flips) consistently boost performance over baseline OS=16.
  • Bootstrapping hard images (e.g., bicycle) during training improves performance on rare/finely annotated classes.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。