QUICK REVIEW

[논문 리뷰] SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction

Suzeyu Chen, Leheng Li|arXiv (Cornell University)|2026. 02. 04.

Advanced Neural Network Applications인용 수 0

한 줄 요약

SPOT-Occ는 Dense cross-attention을 대체하는 희소 프로토타입 가이드 변환기 디코더를 도입하여 두 단계의 프로토타입 선택 및 집계를 수행하고, 더불어 노이즈 제거 학습 패러다임을 도입해 카메라 기반 3D 점유 벤치마크에서 더 높은 정확도와 현저히 낮은 지연 시간을 달성합니다.

ABSTRACT

Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at https://github.com/chensuzeyu/SpotOcc.

연구 동기 및 목표

실시간 자율주행을 위한 카메라 데이터로부터의 효율적인 3D 점유 예측의 필요성을 제시합니다.
희소한 3D 표현에서 디코더 병목을 해결하기 위해 주의를 소형의 복셀 프로토타입 세트에 맞춰 조정합니다.
노이즈 제거 학습을 통해 안정적인 감독을 제공하는 두 단계 프로토타입 가이드 디코딩 프로세스를 제안합니다.
nuScenes-Occupancy 및 SemanticKITTI 벤치마크에서 향상된 정확도와 감소된 지연 시간을 입증합니다.

제안 방법

비용이 큰 밀집 교차 주의를 대체하기 위해 Sparse Prototype-guided Transformer Decoder (SPOT-Occ)를 도입합니다.
Deformable Top-ρ Selection을 구현하여 각 쿼리에 대해 헤드 간 Top-ρ 눈에 띄는 복셀 프로토타입을 선택합니다.
게이트 업데이트를 통한 프로토타입 가이드 집계를 계산하여 쿼리를 정제합니다.
추론 오버헤드 없이 쿼리-프로토타입 연관을 안정시키기 위해 학습 중에 Denoising Head를 적용합니다.
뷰 변환기에 대해 매칭 손실, 노이즈 제거 손실, 깊이 손실을 포함한 복합 손실로 학습합니다.

실험 결과

연구 질문

RQ1희소 프로토타입 가이드 디코더가 밀집 또는 마스킹된 주의 디코더와 비교해 동등하거나 더 나은 3D 점유 정확도를 달성할 수 있는가?
RQ2노이즈 제거 학습 패러다임이 추론 비용을 추가하지 않으면서도 디코더 층 간의 쿼리-프로토타입 연관을 안정화시킬 수 있는가?
RQ33D 점유를 위한 희소 교차 주의에서 프로토타입 비율과 정확도/지연 시간 간의 트레이드오프는 무엇인가?
RQ4표준 카메라 기반 점유 벤치마크에서 SPOT-Occ의 성능은 최첨단 방법과 비교해 어떤가?

주요 결과

SPOT-Occ는 nuScenes-Occupancy 검증에서 13.7% mIoU를 달성하여 SparseOcc(13.2%) 및 GaussianFormer-2(13.4%)를 능가합니다.
SPOT-Occ는 nuScenes-Occupancy 벤치마크에서 GaussianFormer-2에 비해 추론 지연 시간을 57.6% 감소시킵니다.
SemanticKITTI에서 SPOT-Occ는 13.27% mIoU를 달성하여 나열된 카메라 기반 점유 방법 중 최고입니다.
앰블레이션은 Sparse Prototype-guided Cross-Attention (SPOT-CA)가 mIoU를 향상시키고 대기 시간을 단축하며, Denoising (DN) 학습은 훈련을 더 안정화시킵니다.
SPOT-CA와 DN의 결합은 제거 실험에서 최상의 전체 성능(13.27% mIoU)과 감소된 지연 시간(164 ms)을 제공합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.