[Paper Review] RFBNet: Deep Multimodal Networks with Residual Fusion Blocks for RGB-D Semantic Segmentation
RFBNet introduces a bottom-up interactive fusion with residual fusion blocks to fuse RGB and depth streams for RGB-D semantic segmentation, achieving state-of-the-art results on ScanNet and Cityscapes benchmarks.
RGB-D semantic segmentation methods conventionally use two independent encoders to extract features from the RGB and depth data. However, there lacks an effective fusion mechanism to bridge the encoders, for the purpose of fully exploiting the complementary information from multiple modalities. This paper proposes a novel bottom-up interactive fusion structure to model the interdependencies between the encoders. The structure introduces an interaction stream to interconnect the encoders. The interaction stream not only progressively aggregates modality-specific features from the encoders but also computes complementary features for them. To instantiate this structure, the paper proposes a residual fusion block (RFB) to formulate the interdependences of the encoders. The RFB consists of two residual units and one fusion unit with gate mechanism. It learns complementary features for the modality-specific encoders and extracts modality-specific features as well as cross-modal features. Based on the RFB, the paper presents the deep multimodal networks for RGB-D semantic segmentation called RFBNet. The experiments on two datasets demonstrate the effectiveness of modeling the interdependencies and that the RFBNet achieved state-of-the-art performance.
Motivation & Objective
- Motivate robust RGB-D semantic segmentation by effectively exploiting interdependencies between RGB and depth encoders.
- Propose a bottom-up interactive fusion structure with a residual fusion block to enable cross-modal feature learning.
- Reduce computational load by shrinking the depth stream while maintaining performance.
- Demonstrate state-of-the-art performance on indoor (ScanNet) and outdoor (Cityscapes) datasets.
Proposed method
- Introduce three-stream architecture: RGB stream, depth stream, and interaction stream.
- Propose Residual Fusion Block (RFB) consisting of two modality-specific residual units and a gated fusion unit to learn complementary cross-modal features.
- Use a bottom-up interaction mechanism to fuse modalities at higher layers, with the GFU gating cross-modal information via a four-gate mechanism.
- Shrink the depth stream to save computation while aligning depths with RGB features for fusion.
- Integrate RFBs into a base framework (SSMA) for RGB-D fusion and evaluate on ScanNet and Cityscapes.
Experimental results
Research questions
- RQ1Does an explicit bottom-up interactive fusion improve RGB-D semantic segmentation over traditional early, late, or multi-level fusion schemes?
- RQ2Can residual fusion blocks effectively model the interdependencies between RGB and depth encoders to improve segmentation accuracy?
- RQ3What is the impact of reducing depth stream resolution on overall performance and efficiency?
- RQ4How does RFBNet perform across indoor and outdoor RGB-D datasets compared to state-of-the-art methods?
Key findings
- RFBNet consistently outperforms baselines such as SSMA and FuseNet on ScanNet with 59.2% mIoU.
- On Cityscapes, RFBNet with ERFNetEnc reaches 69.7% mIoU on test, and with AdapNet++ reaches 74.8% mIoU on test (multimodal).
- Ablation shows gates add 0.4% efficiency gain, and adding complementary features via the RFB (the R option) yields an additional 0.9% gain, surpassing trunk-only additions.
- Shrinking the depth input reduces depth-based computation by roughly 75% with modest or positive effects when combined with the interactive fusion.
- The RFB structure enables the encoders to exchange information and produce cross-modal features while preserving modality-specific strengths.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.