Skip to main content
QUICK REVIEW

[Paper Review] SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

Mingrui Li, Shuhong Liu|arXiv (Cornell University)|Feb 5, 2024
Robotics and Sensor-Based Localization8 citations
TL;DR

SGS-SLAM is a semantic dense visual SLAM system that uses 3D Gaussian splatting to jointly optimize appearance, geometry, and 2D semantic priors, enabling real-time rendering, accurate 3D semantic segmentation, and object-level scene editing.

ABSTRACT

We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compensates for the shortcomings of traditional depth and color losses in object optimization. Through a semantic-guided keyframe selection strategy, we prevent erroneous reconstructions caused by cumulative errors. Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, precise semantic segmentation, and object-level geometric accuracy, while ensuring real-time rendering capabilities.

Motivation & Objective

  • Motivate dense SLAM with explicit Gaussian representations to overcome NeRF-like oversmoothing and enable real-time rendering and object-level editing.
  • Propose a multi-channel optimization framework that jointly fuses appearance, depth/geometry, and semantic signals through Gaussian Gaussians.
  • Introduce a semantic feature loss and semantic-aware keyframe selection to improve map quality and robustness against cumulative errors.
  • Demonstrate state-of-the-art tracking, mapping, and 3D semantic segmentation on synthetic and real datasets, with real-time rendering.
  • Showcase downstream capabilities like scene editing by manipulating Gaussian groups tied to semantic labels.

Proposed method

  • Represent the scene as an explicit 3D Gaussian radiance field with channels for geometry, appearance, and semantics.
  • Render Gaussians to 2D via differentiable splatting and depth-aware front-to-back composition (Max volume rendering).
  • Use a multi-channel loss L_tracking that combines depth, color, and 2D semantic reprojection with silhouette-based visibility masking.
  • Perform map reconstruction by densifying Gaussians and jointly optimizing geometry, appearance, and semantic channels with a mapping loss combining depth, color (SSIM-based), and semantic color terms.
  • Introduce a two-level keyframe selection strategy based on geometric overlap and semantic-mIoU differences to stabilize tracking and mapping.
  • Allow object-level scene manipulation by editing Gaussian groups corresponding to semantic labels without retraining the whole model.
Figure 1: The illustration of the proposed SGS-SLAM. It employs 2D inputs encompassing appearance, geometry, and semantic information, leveraging Gaussian Splatting and differentiable rendering for multi-channel parameter optimization. During the mapping process, SGS-SLAM maps the 2D semantic prior
Figure 1: The illustration of the proposed SGS-SLAM. It employs 2D inputs encompassing appearance, geometry, and semantic information, leveraging Gaussian Splatting and differentiable rendering for multi-channel parameter optimization. During the mapping process, SGS-SLAM maps the 2D semantic prior

Experimental results

Research questions

  • RQ1Can a 3D Gaussian dense representation be optimized with multi-channel supervision to achieve high-fidelity rendering and accurate 3D semantic segmentation?
  • RQ2Does incorporating semantic information into keyframe selection improve SLAM robustness and map quality over time?
  • RQ3How does semantic-guided optimization affect object-level geometry and downstream scene editing tasks?
  • RQ4What are the performance and memory implications of using explicit Gaussian representations for real-time SLAM on synthetic and real-world data?
  • RQ5How does SGS-SLAM compare to NeRF-based semantic SLAM approaches in tracking, mapping, and segmentation accuracy?

Key findings

  • SGS-SLAM achieves state-of-the-art or leading performance on tracking (ATE RMSE) and mapping (Depth L1, PSNR) metrics on Replica/ScanNet-like benchmarks in the paper’s experiments.
  • The explicit Gaussian representation with multi-channel optimization yields high-fidelity edge preservation and sharp object boundaries, mitigating NeRF oversmoothing.
  • Incorporating 2D semantic priors as an explicit channel improves 3D semantic segmentation accuracy, with reported gains over NeRF-based semantic SLAM baselines.
  • Semantic-guided keyframe selection and uncertainty weighting reduce drift and erroneous reconstructions caused by cumulative tracking errors.
  • Scene editing via Gaussian manipulation (e.g., removing or transforming semantically labeled objects) can be done in real time without retraining, thanks to the decoupled Gaussian representation.
Figure 2: Qualitative comparison of our method and the baselines for reconstruction across three scenes from the Replica Dataset Straub et al. ( 2019 ) , with key details accentuated using colorful boxes. The results demonstrate that our method delivers more high-fidelity and robust reconstructions,
Figure 2: Qualitative comparison of our method and the baselines for reconstruction across three scenes from the Replica Dataset Straub et al. ( 2019 ) , with key details accentuated using colorful boxes. The results demonstrate that our method delivers more high-fidelity and robust reconstructions,

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.