QUICK REVIEW

[Paper Review] RGBT Salient Object Detection: A Large-scale Dataset and Benchmark

Zhengzheng Tu, Yan Ma|arXiv (Cornell University)|Jul 7, 2020

Visual Attention and Saliency Detection36 citations

TL;DR

Introduces VT5000, a large-scale aligned RGBT dataset of 5000 image pairs for salient object detection, and presents ADFNet, an attention-based multi-modal fusion network that outperforms prior methods on VT5000 and two public datasets VT821/VT1000.

ABSTRACT

Salient object detection in complex scenes and environments is a challenging research topic. Most works focus on RGB-based salient object detection, which limits its performance of real-life applications when confronted with adverse conditions such as dark environments and complex backgrounds. Taking advantage of RGB and thermal infrared images becomes a new research direction for detecting salient object in complex scenes recently, as thermal infrared spectrum imaging provides the complementary information and has been applied to many computer vision tasks. However, current research for RGBT salient object detection is limited by the lack of a large-scale dataset and comprehensive benchmark. This work contributes such a RGBT image dataset named VT5000, including 5000 spatially aligned RGBT image pairs with ground truth annotations. VT5000 has 11 challenges collected in different scenes and environments for exploring the robustness of algorithms. With this dataset, we propose a powerful baseline approach, which extracts multi-level features within each modality and aggregates these features of all modalities with the attention mechanism, for accurate RGBT salient object detection. Extensive experiments show that the proposed baseline approach outperforms the state-of-the-art methods on VT5000 dataset and other two public datasets. In addition, we carry out a comprehensive analysis of different algorithms of RGBT salient object detection on VT5000 dataset, and then make several valuable conclusions and provide some potential research directions for RGBT salient object detection.

Motivation & Objective

Create a large, diverse, freely available RGBT dataset (VT5000) with ground truth masks and 11 annotations for challenges.
Propose an end-to-end CNN baseline (ADFNet) using RGB and thermal branches with attention-based fusion.
Improve boundary precision via an edge-aware loss and refine saliency with multi-layer feature fusion and global context modules.
Analyze and compare RGBT SOD methods on VT5000 and public datasets to guide future research.

Proposed method

Develop a two-stream VGG16-based backbone to separately extract RGB and thermal features.
Apply Convolutional Block Attention Module (CBAM) for channel and spatial feature weighting before fusion.
Fuse modality features at multiple layers to preserve both low-level and high-level information.
Incorporate Pyramid Pooling Module (PPM) to provide global contextual guidance at multiple scales.
Use a Feature Aggregation Module (FAM) to integrate multi-scale features after fusion.
Train with a cross-entropy loss and an edge-based refinement loss to sharpen object boundaries.

Figure 1: Sample image pairs with annotated ground truth and challenges from our RGBT dataset. (a) and (b) indicate the RGB image and its corresponding thermal image respectively. (c) is the corresponding ground truth of RGBT image pair.

Experimental results

Research questions

RQ1How large and diverse should an RGBT SOD dataset be to train robust deep networks?
RQ2Can attention-based multi-modal fusion improve RGBT salient object detection over single-modal or simple fusion baselines?
RQ3Do multi-layer fusion and global context modules enhance localization and boundary delineation in RGBT SOD?
RQ4What is the comparative performance of the proposed method on VT5000 and existing VT821/VT1000 datasets?

Key findings

VT5000 provides 5000 aligned RGBT image pairs with 11 annotated challenges, enabling robust evaluation of RGBT SOD methods.
The proposed ADFNet consistently outperforms state-of-the-art methods on VT5000 and two public datasets (VT821 and VT1000).
CBAM-based attention and multi-layer fusion effectively harness complementary RGB and thermal cues for saliency detection.
PPM and FAM modules improve global context awareness and multi-scale feature integration, respectively.
An edge loss helps produce sharper object boundaries in saliency maps.
A comprehensive VT5000 analysis yields actionable insights and future research directions for RGBT SOD.

Figure 2: Challenge distribution of VT821, VT1000 and VT5000.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.