QUICK REVIEW

[論文レビュー] DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual Information for Real-time Semantic Segmentation

Haoran Wei, Xu Liu|arXiv (Cornell University)|Dec 2, 2022

Advanced Neural Network Applications被引用数 33

ひとこと要約

DWRSeg は Region Residualization と Semantic Residualization を導入し、リアルタイムのセマンティックセグメンテーションのためのマルチスケール文脈を効率的に捉え、Cityscapes と CamVid で事前学習なしの最先端の速度-精度トレードオフを実現します。

ABSTRACT

Many current works directly adopt multi-rate depth-wise dilated convolutions to capture multi-scale contextual information simultaneously from one input feature map, thus improving the feature extraction efficiency for real-time semantic segmentation. However, this design may lead to difficult access to multi-scale contextual information because of the unreasonable structure and hyperparameters. To lower the difficulty of drawing multi-scale contextual information, we propose a highly efficient multi-scale feature extraction method, which decomposes the original single-step method into two steps, Region Residualization-Semantic Residualization. In this method, the multi-rate depth-wise dilated convolutions take a simpler role in feature extraction: performing simple semantic-based morphological filtering with one desired receptive field in the second step based on each concise feature map of region form provided by the first step, to improve their efficiency. Moreover, the dilation rates and the capacity of dilated convolutions for each network stage are elaborated to fully utilize all the feature maps of region form that can be achieved.Accordingly, we design a novel Dilation-wise Residual (DWR) module and a Simple Inverted Residual (SIR) module for the high and low level network, respectively, and form a powerful DWR Segmentation (DWRSeg) network. Extensive experiments on the Cityscapes and CamVid datasets demonstrate the effectiveness of our method by achieving a state-of-the-art trade-off between accuracy and inference speed, in addition to being lighter weight. Without pretraining or resorting to any training trick, we achieve an mIoU of 72.7% on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which exceeds the latest methods of a speed of 69.5 FPS and 0.8% mIoU. The code and trained models are publicly available.

研究の動機と目的

リアルタイムのセマンティックセグメンテーションのためのマルチスケールの文脈情報を効率的に捉える動機づけ。
マルチレートの深さ-wise拡張畳込みを用いた学習でのマルチスケール文脈の限界を特定する。
マルチスケール特徴抽出を簡略化する2ステップの枠組み（Region Residualization – Semantic Residualization）を提案する。
ステージ固有の受容野戦略を用いた DWR および SIR モジュールを設計する。
Cityscapes と CamVid で事前学習なしに最先端の速度-精度トレードオフを実証する。

提案手法

複雑なマルチスケール文脈の収集を、簡潔な region-form 特徴マップ上の単純な形態学的フィルタリングへ変換する2段階の特徴抽出を提案する。
Region Residualization を導入し、簡潔な region-form 特徴マップを生成する。
Semantic Residualization を導入し、region-form マップごとに深さ方向拡張畳込みを介して単一の所望の膨張率を適用する。
高い段階にはDilation-wise Residual (DWR) モジュール、低い段階には Simple Inverted Residual (SIR) モジュールを設計する。
軽量なエンコーダ-デコーダアーキテクチャを採用し、FCN様デコーダと最終予測用の SegHead を用いる。
ネットワークの各段で region-form マップを活用するために、膨張率とブランチ容量を調整する。

実験結果

リサーチクエスチョン

RQ1マルチレベルの文脈情報を、マルチレート拡張畳込みの複雑さなしに、どのように効率的に取得できるか？
RQ2ネットワークの高段と低段で、リアルタイムの速度を維持しつつ精度を最大化する受容野戦略は何か？
RQ32段階の残差アプローチは、拡張畳込みベースの文脈モデリングの学習の明瞭性と効率を向上させるか？
RQ4Cityscapes と CamVid で、既存のリアルタイムセグメンテーション手法と比べた際の速度-精度トレードオフはどうなるか？
RQ5段階特異的な受容野設計が、競争力のある mIoU を持つ軽量モデルを生み出す証拠はあるか？

主な発見

Model	Input	Ratio	mIoU (%)	FPS	Params (M)
ENet	0.5	-	58.3	76.9	0.37
ICNet†	1.0	-	69.5	30.3	26.5
DABNet	1.0	-	70.1	27.7	0.76
DFANet B†	1.0	-	67.1	120	4.8
DFANet A†	1.0	-	71.3	100	7.8
BiSeNetV2	0.5	-	72.6	156	2.33
DF1-Seg	1.0	-	73.0	80	8.55
DF2-Seg	1.0	-	74.8	55	8.55
SFNet(DF1)	1.0	-	74.5	121	9.03
STDC1-Seg50†	0.5	-	71.9	250.4	9.97
STDC2-Seg50†	0.5	-	73.4	188.6	14.0
STDC1-Seg75†	0.75	-	75.3	126.7	9.97
STDC2-Seg75†	0.75	-	76.8	97.0	14.0
DWRSeg-B50	0.5	0.5	72.7	319.5	2.54
DWRSeg-L50	0.5	0.5	73.1	256.2	3.53
DWRSeg-B75	0.75	0.75	75.6	151.7	2.54
DWRSeg-L75	0.75	0.75	76.3	123.4	3.53

Cityscapes テストで 72.7% の mIoU を、 GTX 1080 Ti で 319.5 FPS（事前学習なし）で達成。
DWRSeg-L75 は Cityscapes テストで 76.3% の mIoU を 123.4 FPS で達成。
DWRSeg-B50 は 72.7% の mIoU を 319.5 FPS で、パラメータは 2.54M。
CamVid では、DWRSeg-B が 237.2 FPS で 76.5% mIoU、DWRSeg-L が 189.2 FPS で 77.5% を達成。
提案されたモジュール（DWR および SIR）は region-form特徴マップとともに、効率的で標的を絞った受容野の利用と軽量なネットワークを実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。