[Paper Review] SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM
SGS-SLAM is a semantic dense visual SLAM system that uses 3D Gaussian splatting to jointly optimize appearance, geometry, and 2D semantic priors, enabling real-time rendering, accurate 3D semantic segmentation, and object-level scene editing.
We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compensates for the shortcomings of traditional depth and color losses in object optimization. Through a semantic-guided keyframe selection strategy, we prevent erroneous reconstructions caused by cumulative errors. Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, precise semantic segmentation, and object-level geometric accuracy, while ensuring real-time rendering capabilities.
Motivation & Objective
- Motivate dense SLAM with explicit Gaussian representations to overcome NeRF-like oversmoothing and enable real-time rendering and object-level editing.
- Propose a multi-channel optimization framework that jointly fuses appearance, depth/geometry, and semantic signals through Gaussian Gaussians.
- Introduce a semantic feature loss and semantic-aware keyframe selection to improve map quality and robustness against cumulative errors.
- Demonstrate state-of-the-art tracking, mapping, and 3D semantic segmentation on synthetic and real datasets, with real-time rendering.
- Showcase downstream capabilities like scene editing by manipulating Gaussian groups tied to semantic labels.
Proposed method
- Represent the scene as an explicit 3D Gaussian radiance field with channels for geometry, appearance, and semantics.
- Render Gaussians to 2D via differentiable splatting and depth-aware front-to-back composition (Max volume rendering).
- Use a multi-channel loss L_tracking that combines depth, color, and 2D semantic reprojection with silhouette-based visibility masking.
- Perform map reconstruction by densifying Gaussians and jointly optimizing geometry, appearance, and semantic channels with a mapping loss combining depth, color (SSIM-based), and semantic color terms.
- Introduce a two-level keyframe selection strategy based on geometric overlap and semantic-mIoU differences to stabilize tracking and mapping.
- Allow object-level scene manipulation by editing Gaussian groups corresponding to semantic labels without retraining the whole model.

Experimental results
Research questions
- RQ1Can a 3D Gaussian dense representation be optimized with multi-channel supervision to achieve high-fidelity rendering and accurate 3D semantic segmentation?
- RQ2Does incorporating semantic information into keyframe selection improve SLAM robustness and map quality over time?
- RQ3How does semantic-guided optimization affect object-level geometry and downstream scene editing tasks?
- RQ4What are the performance and memory implications of using explicit Gaussian representations for real-time SLAM on synthetic and real-world data?
- RQ5How does SGS-SLAM compare to NeRF-based semantic SLAM approaches in tracking, mapping, and segmentation accuracy?
Key findings
- SGS-SLAM achieves state-of-the-art or leading performance on tracking (ATE RMSE) and mapping (Depth L1, PSNR) metrics on Replica/ScanNet-like benchmarks in the paper’s experiments.
- The explicit Gaussian representation with multi-channel optimization yields high-fidelity edge preservation and sharp object boundaries, mitigating NeRF oversmoothing.
- Incorporating 2D semantic priors as an explicit channel improves 3D semantic segmentation accuracy, with reported gains over NeRF-based semantic SLAM baselines.
- Semantic-guided keyframe selection and uncertainty weighting reduce drift and erroneous reconstructions caused by cumulative tracking errors.
- Scene editing via Gaussian manipulation (e.g., removing or transforming semantically labeled objects) can be done in real time without retraining, thanks to the decoupled Gaussian representation.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.