QUICK REVIEW

[論文レビュー] Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Yuanhui Huang, Wenzhao Zheng|arXiv (Cornell University)|Feb 15, 2023

Advanced Vision and Imaging被引用数 8

ひとこと要約

TPVFormer は、RGB 画像を 3D セマンティック占有へ引き上げるための top、side、front の三 perspectives を持つ TPV を導入し、カメラのみを用いて競争力のある LiDAR-segmentation 性能を達成します。

ABSTRACT

Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.

研究の動機と目的

LiDAR の代替として完全な 3D セマンティック占有予測を目指す視覚ベースの 3D 認識を動機づける。
3 つの直交平面（トップ、サイド、フロント）を用いて 3D 構造を保持する TPV 表現を提案する。
Attention によって 2D 画像特徴を TPV 空間へ持ち上げる transformer ベースの TPVFormer を開発する。
カメラのみの TPVFormer が LiDAR セグメンテーションと semant ic scene completion タスクを実行できることを示す。
TPVFormer が LiDAR ベースの結果と同等、または希薄な supervisi on 下で占有予測を改善することを示す。

提案手法

三平面 TPV を T HW, T DH, T WD の 3 平面で定義し、トップ、サイド、フロントビューをカバーする。
各 TPV 平面へ 3D 点を投影し、双線形補間で特徴をサンプリングし、点特徴 (f_x,y,z) を得る。
画像クロスアテンション (ICA) を用いて TPV クエリ上で deformable attention によって画像特徴を TPV 空間へ持ち上げる。
TPV 平面間の相互作用を可能にする cross-view hybrid-attention (CVHA) を有効化する。
3D 位置埋め込みを持つ学習可能パラメータとして初期化された TPV クエリを用い、TPVFormer に HCAB および HAB トランスフォーマーブロックを積み重ねる。
TPV_features を点/体素特徴へ変換し、セマンティックセグメンテーション用に軽量な 2 層 MLP を適用する。

実験結果

リサーチクエスチョン

RQ1三-perspective TPV 表現は BEV よりも細かな 3D 構造を効率を保ったまま捉えられるか。
RQ2訓練時の希薄な LiDAR 監督を用いて、トランスフォーマー型 TPVFormer が複数視点 RGB 特徴を 3D TPV 空間へどれだけ持ち上げられるか。
RQ3視覚のみの TPVFormer は LiDAR ベースの方法と LiDAR セグメンテーションと semantic scene completion タスクで競争力があるか。
RQ4希薄な監督下で 3D 占有予測を最大化する建築的選択肢（例: HCAB vs HAB ブロック、解像度）はどれか。

主な発見

TPVFormer は nuScenes LiDAR セグメンテーションで RGB 入力のみの監督を用いて LiDAR ベースの手法と同等の mIoU を達成。
TPV 表現は 3 平面間での文脈的多様化を上回り、O(HW+DH+WD) のストレージとともに細かな 3D 構造を保持。
テスト時に TPV 平面解像度を上げると、モデルはより詳細な物体形状を捉える。
TPVFormer-Small と TPVFormer-Base は、MonoScene よりはるかに少ないパラメータと FLOPs で強力な性能を示す。
本手法は密な意味的占有を予測でき、検証データ上で占有整合性のある結果を時には ground-truth LiDAR segmentation を上回って示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。