QUICK REVIEW

[Paper Review] SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

Mingrui Li, Shuhong Liu|arXiv (Cornell University)|Feb 5, 2024

Robotics and Sensor-Based Localization8 citations

TL;DR

SGS-SLAM is a semantic dense visual SLAM system that uses 3D Gaussian splatting to jointly optimize appearance, geometry, and 2D semantic priors, enabling real-time rendering, accurate 3D semantic segmentation, and object-level scene editing.

ABSTRACT

We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compensates for the shortcomings of traditional depth and color losses in object optimization. Through a semantic-guided keyframe selection strategy, we prevent erroneous reconstructions caused by cumulative errors. Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, precise semantic segmentation, and object-level geometric accuracy, while ensuring real-time rendering capabilities.

Motivation & Objective

Motivate dense SLAM with explicit Gaussian representations to overcome NeRF-like oversmoothing and enable real-time rendering and object-level editing.
Propose a multi-channel optimization framework that jointly fuses appearance, depth/geometry, and semantic signals through Gaussian Gaussians.
Introduce a semantic feature loss and semantic-aware keyframe selection to improve map quality and robustness against cumulative errors.
Demonstrate state-of-the-art tracking, mapping, and 3D semantic segmentation on synthetic and real datasets, with real-time rendering.
Showcase downstream capabilities like scene editing by manipulating Gaussian groups tied to semantic labels.

Proposed method

Represent the scene as an explicit 3D Gaussian radiance field with channels for geometry, appearance, and semantics.
Render Gaussians to 2D via differentiable splatting and depth-aware front-to-back composition (Max volume rendering).
Use a multi-channel loss L_tracking that combines depth, color, and 2D semantic reprojection with silhouette-based visibility masking.
Perform map reconstruction by densifying Gaussians and jointly optimizing geometry, appearance, and semantic channels with a mapping loss combining depth, color (SSIM-based), and semantic color terms.
Introduce a two-level keyframe selection strategy based on geometric overlap and semantic-mIoU differences to stabilize tracking and mapping.
Allow object-level scene manipulation by editing Gaussian groups corresponding to semantic labels without retraining the whole model.

Figure 1: The illustration of the proposed SGS-SLAM. It employs 2D inputs encompassing appearance, geometry, and semantic information, leveraging Gaussian Splatting and differentiable rendering for multi-channel parameter optimization. During the mapping process, SGS-SLAM maps the 2D semantic prior

Experimental results

Research questions

RQ1Can a 3D Gaussian dense representation be optimized with multi-channel supervision to achieve high-fidelity rendering and accurate 3D semantic segmentation?
RQ2Does incorporating semantic information into keyframe selection improve SLAM robustness and map quality over time?
RQ3How does semantic-guided optimization affect object-level geometry and downstream scene editing tasks?
RQ4What are the performance and memory implications of using explicit Gaussian representations for real-time SLAM on synthetic and real-world data?
RQ5How does SGS-SLAM compare to NeRF-based semantic SLAM approaches in tracking, mapping, and segmentation accuracy?

Key findings

SGS-SLAM achieves state-of-the-art or leading performance on tracking (ATE RMSE) and mapping (Depth L1, PSNR) metrics on Replica/ScanNet-like benchmarks in the paper’s experiments.
The explicit Gaussian representation with multi-channel optimization yields high-fidelity edge preservation and sharp object boundaries, mitigating NeRF oversmoothing.
Incorporating 2D semantic priors as an explicit channel improves 3D semantic segmentation accuracy, with reported gains over NeRF-based semantic SLAM baselines.
Semantic-guided keyframe selection and uncertainty weighting reduce drift and erroneous reconstructions caused by cumulative tracking errors.
Scene editing via Gaussian manipulation (e.g., removing or transforming semantically labeled objects) can be done in real time without retraining, thanks to the decoupled Gaussian representation.

Figure 2: Qualitative comparison of our method and the baselines for reconstruction across three scenes from the Replica Dataset Straub et al. ( 2019 ) , with key details accentuated using colorful boxes. The results demonstrate that our method delivers more high-fidelity and robust reconstructions,

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.