QUICK REVIEW

[論文レビュー] Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Xuewu Lin, Tianwei Lin|arXiv (Cornell University)|May 23, 2023

Advanced Image Fusion Techniques被引用数 20

ひとこと要約

Sparse4D v2 は、Sparse4D v2 introduces a recurrent temporal fusion module for sparse multi-view 3D detection, reducing temporal fusion from O(T) to O(1) and achieving state-of-the-art results on nuScenes, with Efficient Deformable Aggregation and camera-parameter encoding improving efficiency and robustness.

ABSTRACT

Sparse algorithms offer great flexibility for multi-view temporal perception tasks. In this paper, we present an enhanced version of Sparse4D, in which we improve the temporal fusion module by implementing a recursive form of multi-frame feature sampling. By effectively decoupling image features and structured anchor features, Sparse4D enables a highly efficient transformation of temporal features, thereby facilitating temporal fusion solely through the frame-by-frame transmission of sparse features. The recurrent temporal fusion approach provides two main benefits. Firstly, it reduces the computational complexity of temporal fusion from $O(T)$ to $O(1)$, resulting in significant improvements in inference speed and memory usage. Secondly, it enables the fusion of long-term information, leading to more pronounced performance improvements due to temporal fusion. Our proposed approach, Sparse4Dv2, further enhances the performance of the sparse perception algorithm and achieves state-of-the-art results on the nuScenes 3D detection benchmark. Code will be available at \url{https://github.com/linxuewu/Sparse4D}.

研究の動機と目的

Motivate improved sparse-based perception for autonomous driving by enabling efficient long-term temporal fusion.
Develop a recurrent temporal fusion mechanism that decouples image features from instance states to reduce computation.
Enhance training and robustness through camera-parameter encoding and dense depth supervision.
Improve memory efficiency and inference speed via an optimized deformable aggregation operation.
Demonstrate state-of-the-art performance on nuScenes compared to BEV-based and other sparse methods.

提案手法

Replace multi-frame sampling with a recurrent temporal propagation of instance features across frames.
Decouple image features and instance states using anchor, instance feature, and anchor embedding, with a lightweight anchor encoder Psi for temporal projection.
Propagate anchors through ego-motion to current frame and re-encode their position embeddings for temporal cross-attention in the decoder.
Introduce Efficient Deformable Aggregation (EDA) to fuse multi-view, multi-scale features as a single CUDA operation, reducing memory and increasing speed.
Incorporate camera parameter encoding directly into the view-weight computation to improve robustness to camera variations.
Add dense depth supervision from LiDAR point clouds to stabilize and accelerate training, with a later removal of the depth-reweight module.

実験結果

リサーチクエスチョン

RQ1How can temporal fusion for sparse multi-view 3D detection be made independent of the number of historical frames to improve speed and memory usage?
RQ2Can instance-level temporal propagation with ego-motion-based anchoring replace multi-frame sampling without sacrificing accuracy?
RQ3Does explicitly encoding camera parameters and incorporating dense depth supervision improve robustness and detection performance across viewpoints and scenarios?
RQ4What are the efficiency and memory benefits of a redesigned deformable aggregation operator in sparse temporal fusion?

主な発見

Sparse4Dv2 achieves faster inference and lower memory usage than Sparse4Dv1, with substantial gains across frames (example: FPS and GPU memory improvements on RTX 3090).
The recurrent temporal fusion enables long-term information fusion without increasing the number of anchors, maintaining comparable inference speed to non-temporal models.
Efficient Deformable Aggregation (EDA) reduces training memory by about half and increases training batch size and overall speed, while also boosting inference FPS by about 42%.
Camera parameter encoding improves orientation and overall perception metrics; removing it degrades mAP and mAOE.
Dense depth supervision significantly boosts performance (e.g., reductions in gradient collapse and improved mAP/NDS when enabled).
On nuScenes validation with ResNet50 and 256x704 input, Sparse4Dv2 achieves mAP 0.439 and NDS 0.539, outperforming several BEV-based and sparse baselines; on test, Sparse4Dv2 with VovNet-99 reaches mAP 0.557 and NDS 0.638, indicating SOTA potential.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。