QUICK REVIEW

[논문 리뷰] SPGNet: Semantic Prediction Guidance for Scene Parsing

Bowen Cheng, Liang-Chieh Chen|arXiv (Cornell University)|2019. 08. 26.

Human Pose and Action Recognition참고 문헌 78인용 수 35

한 줄 요약

SPGNet은 두단계 인코더-디코더 네트워크 내의 Semantic Prediction Guidance (SPG) 모듈을 통해 픽셀 단위 시맨틱 감독으로 로컬 특징을 재가중하며, 높은 효율성으로 Cityscapes에서 강력한 성능을 달성합니다.

ABSTRACT

Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path. In contrast, multi-stage encoder-decoder networks have been widely used in human pose estimation and show superior performance than their single-stage counterpart. However, few efforts have been attempted to bring this effective design to semantic segmentation. In this work, we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction. We find that by carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we report experimental results on the semantic segmentation benchmark Cityscapes, in which our SPGNet attains 81.1% on the test set using only 'fine' annotations.

연구 동기 및 목표

효율적인 다단계 아키텍처로 시맨틱 분할의 동기를 부여한다.
픽셀 수준의 시맨틱 예측에 의해 가이드되는 특징 재가중을 위한 SPG 모듈을 제안한다.
경계 및 컨텍스트 융합 향상을 위한 다단계 인코더-디코더 네트워크를 탐구한다.
Cityscapes에서 평가하여 정확도와 효율성 향상을 입증한다.
SPG 메커니즘을 설명하기 위한 ablation 연구와 시각화를 제공한다.

제안 방법

첫 번째 단계 예측으로부터 Guided Attention을 생성하기 위해 supervise-and-excite 프레임워크를 사용하는 SPG 모듈을 도입한다.
후기 단계를 강화하기 위해 Cross Stage Feature Aggregation이 있는 두단계 인코더-디코더를 사용한다.
효율적인 특징 융합을 위해 잔차 블록이 포함된 경량 업샘플 모듈을 설계한다.
1x1 컨볼루션을 통해 픽셀당 채널 마스크를 생성하고 Guided Attention을 계산하며, 이에 따라 디코더 특징을 재가중한다.
최종 단계와 중간 단계 로짓의 손실로 감독 학습한다.
Cityscapes의 최신 방법과 비교하고 광범위한 ablations 및 시각화를 수행한다.

실험 결과

연구 질문

RQ1픽셀 단위의 시맨틱 예측에 의해 가이드될 때 SPG 모듈이 특징 재가중 및 분할 정확도를 향상시키는가?
RQ2유사한 매개변수 및 계산량을 가진 단일 단계 대안보다 SPG를 포함한 두단계 인코더-디코더가 성능을 낼 수 있는가?
RQ3Cityscapes에서 정확도와 효율성 측면에서 SPGNet은 DenseASPP 및 DANet과 어떻게 비교되는가?
RQ4SPG 구성 요소들(감독, 아이덴티티 매핑, excitation 메커니즘)의 overall 성능에 대한 기여는 무엇인가?
RQ5SPG와 결합되었을 때 다단계 인코더-디코더 네트워킹이 시맨틱 분할에 이로운가?

주요 결과

방법	백본	mIoU%	매개변수	FLOPs (B)
DenseASPP	DenseNet-161	80.6	35.4 M	1240.1
DANet	ResNet-101	81.5	66.5 M	2878.9
SPGNet (Ours)	2× ResNet-50	81.1	59.8 M	654.8

SPGNet은 Cityscapes 테스트에서 미세 주석만 사용하여 평균 IoU 81.1%를 달성한다.
SPGNet은 대부분의 클래스에서 Cityscapes 테스트에서 DenseASPP를 능가하며 DANet의 약 22.7%의 계산량을 사용한다.
ResNet-50 백본의 2단계 SPGNet은 많은 최상위 방법에 비해 FLOPs 및 매개변수 수가 현저히 낮으면서도 강력한 정확도를 달성한다.
Ablation은 감독 및 시그모이드 기반 SPG excitation과 아이덴티티 경로가 최상의 mIoU를 얻는다는 것을 보여주며(ResNet-18에서 검증 세트 77.67%).
Guided Attention 맵은 해석 가능한 재가중화를 제공하고 유사한 클래스 간의 물체 위치화 및 식별을 시각화한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.