QUICK REVIEW

[論文レビュー] Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection

Zhiwei Ning, Xuanang Gao|arXiv (Cornell University)|Feb 23, 2026

Advanced Neural Network Applications被引用数 0

ひとこと要約

Fore-Mamba3D は foreground-focused encoding を導入し、 regional-to-global sliding windows と semantic-assisted state fusion を用いて 3D 物体検出を強化し、nuScenes の LiDAR だけ手法の中で最先端の結果を達成し、KITTI と Waymo で高い性能を発揮します。

ABSTRACT

Linear modeling methods like Mamba have been merged as the effective backbone for the 3D object detection task. However, previous Mamba-based methods utilize the bidirectional encoding for the whole non-empty voxel sequence, which contains abundant useless background information in the scenes. Though directly encoding foreground voxels appears to be a plausible solution, it tends to degrade detection performance. We attribute this to the response attenuation and restricted context representation in the linear modeling for fore-only sequences. To address this problem, we propose a novel backbone, termed Fore-Mamba3D, to focus on the foreground enhancement by modifying Mamba-based encoder. The foreground voxels are first sampled according to the predicted scores. Considering the response attenuation existing in the interaction of foreground voxels across different instances, we design a regional-to-global slide window (RGSW) to propagate the information from regional split to the entire sequence. Furthermore, a semantic-assisted and state spatial fusion module (SASFMamba) is proposed to enrich contextual representation by enhancing semantic and geometric awareness within the Mamba model. Our method emphasizes foreground-only encoding and alleviates the distance-based and causal dependencies in the linear autoregression model. The superior performance across various benchmarks demonstrates the effectiveness of Fore-Mamba3D in the 3D object detection task.

研究の動機と目的

3D ボクセル列における背景ノイズを削減するための foreground-centered encoding を動機づける。
autoregressive Mamba における応答減衰を緩和するための regional-to-global sliding window を開発する。
semantic-assisted および state space の fusion によって Mamba (SASFMamba) の文脈表現を強化する。
検出性能を向上させつつ、メモリと計算コストを削減する。
nuScenes、KITTI、Waymo のベンチマークで有効性を検証する。

提案手法

Hilbert 曲線で平坦化されたシーケンスに沿って foreground ボクセルスコアを予測し、トップk ボクセルを foreground 特徴としてサンプリングする。
autoregressive Mamba ボトーンで regional 情報を global シーケンスへ伝播させるために regional-to-global sliding window (RGSW) を適用する。
SAF（semantic-assisted fusion）と SSF（state spatial fusion）を備えた SASFMamba を導入し、状態変数の意味論的・幾何的文脈を豊かにする。
regional truncation を緩和するため multi-rotation Hilbert flattening を用い、回転させた foreground Features を背景ボクセルと結合する。
foreground スコアと semantic カテゴリの焦点損失（ focal loss ）に加え、検出ヘッドの標準の L_cls および L_reg 損失を用いて学習する。

実験結果

リサーチクエスチョン

RQ1foreground-focused encoding と RGSW は従来の全ボクセル Mamba ボトーンより長距離相互作用を改善するか？
RQ2SAF と SSF は線形 Mamba ボーンの状態変数に意味的・幾何的な有意な改善を提供できるか？
RQ3サンプリング比、効率、検出精度のトレードオフは標準の LiDAR ベンチマークでどうなるか？
RQ4Fore-Mamba3D は nuScenes、KITTI、Waymo のデータセットに対して最先端の LiDAR だけ検出器と比べてどのように性能を示すか？

主な発見

Method	Present at	mAP	NDS	Car	Truck	Bus	Trailer	C.V.	Ped.	Motor.	Bike	T.C.	Barrier
CenterPoint	CVPR21	59.2	66.5	84.9	57.4	70.7	38.1	16.9	85.1	59.0	42.0	69.8	68.3
TransFusion-L	CVPR22	65.5	70.1	86.9	60.8	73.1	43.4	25.2	87.5	72.9	57.3	77.2	70.3
VoxelNeXt	CVPR23	64.5	70.0	84.6	53.0	64.7	55.8	28.7	85.8	73.2	45.7	79.0	74.6
DSVT	CVPR23	66.4	71.1	87.4	62.6	75.9	42.1	25.3	88.2	74.8	58.7	77.9	71.0
HEDNet	NIPS23	66.7	71.4	87.7	60.6	77.8	50.7	28.9	87.1	74.3	56.8	76.3	66.9
SAFDNet	CVPR24	66.3	71.0	87.6	60.8	78.0	43.5	26.6	87.8	75.5	58.0	75.0	69.7
Voxel-Mamba	NIPS24	67.5	71.9	87.9	62.8	76.8	45.9	24.9	89.3	77.1	58.6	80.1	71.5
LION	NIPS24	68.0	72.1	87.9	64.9	77.6	44.4	28.5	89.6	75.6	59.4	80.8	71.6
Fore-Mamba3D (Ours)	–	68.4	72.3	88.4	65.2	80.3	48.0	28.2	89.3	75.7	57.7	80.0	71.2

Fore-Mamba3D は nuScenes/ KITTI で最先端レベルに競合する結果を達成し、nuScenes の val で Fore-Mamba3D の mAP は 68.4、NDS は 72.3、test で 70.1 mAP、74.0 NDS を達成。
KITTI では Fore-Mamba3D が競合するバックボーンの中で最先端の性能を達成（2 番目に優れた方法より平均で改善）。
Waymo（サブセット学習）では Fore-Mamba3D が車両/歩行者/自転車の L1/L2 全体で 72.2–75.6 AP/APH を達成し、CenterPoint などのいくつかのベースラインを上回る（L2 で）。
アブレーション実験により Hilbert flattening と回転、RGSW、SAF、SSF の組み合わせが累積利得を生み、SAF のカーネルサイズ K=7 が精度と効率の最良のトレードオフを提供する。
alpha ≈ 0.2 の foreground sampling が精度と FLOPs の最良バランスを提供し、単一 GPU テストで LION に比べ FLOPs を 43.7% 削減し FPS を 23.9% 向上。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。