QUICK REVIEW

[論文レビュー] Deep Contrast Learning for Salient Object Detection

Guanbin Li, Yizhou Yu|arXiv (Cornell University)|Mar 7, 2016

Visual Attention and Saliency Detection参考文献 42被引用数 132

ひとこと要約

Two-stream end-to-end deep network (pixel-level MS-FCN and segment-level pooling) learns visual contrast for salient object detection; optional fully connected CRF post-processing improves spatial coherence.

ABSTRACT

Salient object detection has recently witnessed substantial progress due to powerful features extracted using deep convolutional neural networks (CNNs). However, existing CNN-based methods operate at the patch level instead of the pixel level. Resulting saliency maps are typically blurry, especially near the boundary of salient objects. Furthermore, image patches are treated as independent samples even when they are overlapping, giving rise to significant redundancy in computation and storage. In this CVPR 2016 paper, we propose an end-to-end deep contrast network to overcome the aforementioned limitations. Our deep network consists of two complementary components, a pixel-level fully convolutional stream and a segment-wise spatial pooling stream. The first stream directly produces a saliency map with pixel-level accuracy from an input image. The second stream extracts segment-wise features very efficiently, and better models saliency discontinuities along object boundaries. Finally, a fully connected CRF model can be optionally incorporated to improve spatial coherence and contour localization in the fused result from these two streams. Experimental results demonstrate that our deep model significantly improves the state of the art.

研究の動機と目的

Motivate robust salient object detection beyond patch-based CNNs by modeling visual contrast at pixel and segment levels.
Propose an end-to-end architecture that produces high-resolution saliency maps efficiently.
Allow boundary-aware refinement via a fully connected CRF on the fused outputs.

提案手法

Introduce a two-stream architecture: a pixel-level multi-scale fully convolutional network (MS-FCN) that produces a dense saliency map, and a segment-level spatial pooling stream that computes saliency over superpixels efficiently.
Fuse the two saliency maps via a 1x1 convolution layer whose weights are learned.
Optionally refine the fused map with a fully connected CRF to improve spatial coherence and contour localization.
Train streams in alternation: initialize the segment stream, then jointly fine-tune both streams and the fusion layer with a cross-entropy loss against ground-truth saliency maps.
Use an 8-pixel stride MS-FCN with hole (à trous) convolutions to maintain resolution and multi-scale context.
Define a loss weighting beta_i to balance salient and non-salient pixel contributions in training.

実験結果

リサーチクエスチョン

RQ1Can an end-to-end architecture combining pixel-level and segment-level cues outperform patch-based CNNs in salient object detection?
RQ2Does incorporating a CRF post-processing step yield measurable gains in spatial coherence and boundary accuracy?
RQ3How do multi-scale contextual features and segment-level masking contribute to saliency accuracy across diverse datasets?
RQ4Is fused MS-FCN and segment-level saliency more robust across images with multiple or boundary-touching salient objects?

主な発見

データセット	指標	SF	GC	DRFI	PISA	BSCA	LEGS	MC	MDF	FCN	DCL	DCL +
MSRA-B	maxF	0.700	0.719	0.845	0.837	0.830	0.870	0.894	0.885	0.864	0.905	0.916
MSRA-B	MAE	0.166	0.159	0.112	0.102	0.130	0.081	0.054	0.066	0.096	0.052	0.047
HKU-IS	MAE	0.173	0.211	0.167	0.127	0.174	0.118	0.102	0.076	0.087	0.054	0.049
DUT-OMRON	MAE	0.147	0.218	0.150	0.141	0.191	0.133	0.088	0.092	0.131	0.084	0.080
PASCAL-S	MAE	0.240	0.266	0.210	0.196	0.224	0.157	0.145	0.145	0.128	0.113	0.108
SOD	MAE	0.267	0.284	0.223	0.223	0.251	0.195	0.179	0.155	0.158	0.129	0.126

The DCL (two-stream) model outperforms prior methods on multiple datasets in maxF, MAE, and precision-recall analyses.
Adding CRF refinement (DCL +) yields further gains in accuracy and contour preservation across datasets.
The MS-FCN stream contributes substantially to performance, with the full two-stream fusion providing the best results.
The proposed method achieves state-of-the-art results versus eight recent methods and a FCN baseline across MSRA-B, HKU-IS, DUT-OMRON, PASCAL-S, and SOD datasets.
Training is feasible (approximately 25 hours on MSRA-B) and testing is efficient (approximately 1.5 seconds per image for DCL; 0.8 seconds for CRF refinement).
Ablation studies show both deep contrast learning and CRF contribute to improvements, with the two streams complementing each other.]
table_headers: ["データセット", "指標", "SF", "GC", "DRFI", "PISA", "BSCA", "LEGS", "MC", "MDF", "FCN", "DCL", "DCL +"]
table_rows: [["MSRA-B", "maxF", "0.700", "0.719", "0.845", "0.837", "0.830", "0.870", "0.894", "0.885", "0.864", "0.905", "0.916"], ["MSRA-B", "MAE", "0.166", "0.159", "0.112", "0.102", "0.130", "0.081", "0.054", "0.066", "0.096", "0.052", "0.047"], ["HKU-IS", "MAE", "0.173", "0.211", "0.167", "0.127", "0.174", "0.118", "0.102", "0.076", "0.087", "0.054", "0.049"], ["DUT-OMRON", "MAE", "0.147", "0.218", "0.150", "0.141", "0.191", "0.133", "0.088", "0.092", "0.131", "0.084", "0.080"], ["PASCAL-S", "MAE", "0.240", "0.266", "0.210", "0.196", "0.224", "0.157", "0.145", "0.145", "0.128", "0.113", "0.108"], ["SOD", "MAE", "0.267", "0.284", "0.223", "0.223", "0.251", "0.195", "0.179", "0.155", "0.158", "0.129", "0.126"]]} {
}

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。