QUICK REVIEW

[论文解读] SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction

Suzeyu Chen, Leheng Li|arXiv (Cornell University)|Feb 4, 2026

Advanced Neural Network Applications被引用 0

一句话总结

SPOT-Occ 引入稀疏原型引导的Transformer解码器，用两阶段原型选择与聚合替代密集交叉注意力，并辅以去噪训练范式，在摄像头基础的3D占据基准上实现更高准确率和显著更低延迟。

ABSTRACT

Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at https://github.com/chensuzeyu/SpotOcc.

研究动机与目标

从摄像头数据实现实时自动驾驶的高效3D占据预测的动机。
通过将注意力针对紧凑的体素原型集合进行微调，解决稀疏3D表示中的解码瓶颈。
提出两阶段的原型引导解码过程，并通过去噪训练实现稳定的监督。
在 nuScenes-Occupancy 与 SemanticKITTI 基准上展示更高的准确性与更低的延迟。

提出的方法

引入稀疏原型引导的Transformer解码器（SPOT-Occ），以替代代价高的密集交叉注意力。
实现Deformable Top-ρ选择，在每个查询与各头之间挑选Top-ρ个显著的体素原型。
通过门控更新计算原型引导的聚合以细化查询。
在训练阶段应用去噪头以稳定查询-原型关联，而不增加推理开销。
使用包含匹配损失、去噪损失和视图变换深度损失的综合损失进行训练。

实验结果

研究问题

RQ1一个稀疏、原型引导的解码器是否可达到与密集或掩码注意力解码器相媲美甚至更好的3D占据精度？
RQ2去噪训练范式是否在不增加推理成本的前提下，稳定解码器层之间的查询-原型关联？
RQ3在3D占据的稀疏跨注意力中，原型比例与准确性/延迟之间的权衡是什么？
RQ4与最先进方法相比，SPOT-Occ在标准摄像头基础的占据基准上的表现如何？

主要发现

SPOT-Occ 在 nuScenes-Occupancy 验证集中达到13.7%的mIoU，超过 SparseOcc（13.2%）和 GaussianFormer-2（13.4%）。
SPOT-Occ 相较 GaussianFormer-2 在 nuScenes-Occupancy 基准上将推理延迟降低57.6%。
在 SemanticKITTI 上，SPOT-Occ 达到13.27%的mIoU，是所列摄像头基础占据方法中的最佳。
消融实验显示稀疏原型引导的跨注意力（SPOT-CA）提升了mIoU并降低了延迟，去噪训练（DN）进一步稳定了训练。
将 SPOT-CA 与 DN 结合在消融测试中获得最佳整体性能（13.27% mIoU），并在延迟方面降低至164 ms。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。