[論文レビュー] BEV-Seg: Bird's Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud
BEV-Segは monocular depth と semantic segmentation を用いて semantic point cloud を構築し、それを BEV に投影して最終的なセグメンテーションを行う2段階パイプラインを導入し、最先端の結果と転送性の向上を達成します。
Bird's-eye-view (BEV) is a powerful and widely adopted representation for road scenes that captures surrounding objects and their spatial locations, along with overall context in the scene. In this work, we focus on bird's eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV from side RGB images. This task is made possible by simulators such as Carla, which allow for cheap data collection, arbitrary camera placements, and supervision in ways otherwise not possible in the real world. There are two main challenges to this task: the view transformation from side view to bird's eye view, as well as transfer learning to unseen domains. Existing work transforms between views through fully connected layers and transfer learns via GANs. This suffers from a lack of depth reasoning and performance degradation across domains. Our novel 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel semantics in an efficient manner, allowing the model to leverage depth information to infer objects' spatial locations in the BEV. In addition, we transfer learning by abstracting high-level geometric features and predicting an intermediate representation that is common across different domains. We publish a new dataset called BEVSEG-Carla and show that our approach improves state-of-the-art by 24% mIoU and performs well when transferred to a new domain.
研究の動機と目的
- RGB カメラのみから LiDAR なしで堅牢な BEV semantic segmentation を動機づける。
- サイドビューの深度推論と幾何を活用して横断ビューを BEV に変換する。
- 中間表現を抽象化することでドメイン間の転移学習を改善する。
- 多様な天候と環境を含む CARLAベースのデータセットを提供する。
- 従来手法に対して最先端の性能と転移利得を示す。
提案手法
- Two-stage pipeline: stage 1 creates a semantic point cloud by fusing side-view semantic maps and monocular depth for multiple views using pin-hole camera geometry.
- Project the semantic point cloud to an incomplete BEV through orthographic projection with height-based conflict resolution.
- Stage 2 uses a parser network to transform the incomplete BEV into a full BEV semantic segmentation via an expanded one-hot representation.
- Depth and segmentation modules are trained with ground-truths in the side view (depth from LiDAR projection, segmentation labels).
- Stage 2 operates on a common intermediate representation to improve transferability across domains.
実験結果
リサーチクエスチョン
- RQ1Can explicit depth reasoning and geometric projection improve BEV semantic segmentation from RGB images?
- RQ2Does a modular two-stage pipeline enhance transfer learning across different CARLA environments compared to end-to-end methods?
- RQ3What is the impact of using an intermediate representation on BEV segmentation quality and domain transfer?
- RQ4How does BEV-Seg perform on a newly released BEV dataset (BEVSEG-Carla) relative to prior methods?
主な発見
| Model | Source Domain mIoU | Target Domain after Transfer Learning mIoU |
|---|---|---|
| VPN | 36.4% | 27.8% |
| ours (BEV-Seg full) | 60.4% | 44.5% |
| ours - Segmentation Oracle | 60.8% | - |
| ours - Depth Oracle | 66.5% | - |
| ours - Depth & Segmentation Oracle | 67.3% | - |
- BEV-Seg improves mIoU from 36.4% (VPN baseline) to 60.4% on BEVSEG-Carla in the source domain.
- Transfer from clear noon to wet sunset yields 44.5% mIoU for BEV-Seg versus 27.8% for the VPN baseline.
- Per-class IoU shows BEV-Seg better captures pedestrians, road lines, lanes, signs, and smaller objects compared to VPN.
- Oracle variants (ground truth depth/segmentation) indicate depth and segmentation accuracy are critical for peak BEV performance, with depth oracle reaching 66.5% and combined depth+segmentation oracle at 67.3%.
- The modular intermediate representation significantly reduces domain gap, enabling effective transfer without retraining stage 2.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。