QUICK REVIEW

[论文解读] Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Yuanhui Huang, Wenzhao Zheng|arXiv (Cornell University)|Feb 15, 2023

Advanced Vision and Imaging被引用 8

一句话总结

TPVFormer 引入一个三视角 TPV，具有顶部、侧面和前平面，将 RGB 图像提升到三维语义占据，使用仅摄像头即可实现与 LiDAR 分割相竞争的性能。

ABSTRACT

Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.

研究动机与目标

将基于视觉的三维感知作为对 LiDAR 在完成三维语义占据预测方面的替代方案进行动机阐述。
提出通过使用三个正交平面（顶部、侧面、前方）来保留三维结构的 TPV 表征。
开发基于 Transformer 的 TPVFormer，通过注意力将二维图像特征提升到 TPV 空间。
证明仅靠摄像头的 TPVFormer 能够完成 LiDAR 分割和语义场景完成任务。
展示 TPVFormer 在稀疏监督下实现与 LiDAR 基于结果相当并提升占据预测能力。

提出的方法

定义三视角视图（TPV），包含三个平面：T HW、T DH、T WD，以覆盖顶部、侧面和前视图。
将 3D 点投影到每个 TPV 平面，通过双线性插值采样特征，并求和得到点特征 (f_x,y,z)。
使用图像交叉注意力（ICA）在 TPV 查询上进行可变形注意力，将图像特征提升到 TPV 空间。
启用跨视图混合注意力（CVHA），允许 TPV 平面之间的交互。
使用作为可学习参数初始化的 TPV 查询，并具备 3D 位置嵌入；在 TPVFormer 中堆叠 HCAB 与 HAB Transformer 块。
将 TPV 特征转换为点/体素特征，并应用一个轻量级的两层 MLP 进行语义分割。

实验结果

研究问题

RQ1三视角 TPV 表征是否能在保持效率的前提下比 BEV 更好地捕捉细粒度的三维结构？
RQ2在训练中使用稀疏 LiDAR 监督时，基于 Transformer 的 TPVFormer 能多好地将多视 RGB 特征提升到 3D TPV 空间？
RQ3仅视觉的 TPVFormer 是否在 LiDAR 分割和语义场景完成任务上具备与 LiDAR 基于方法的竞争力？
RQ4哪些架构选择（如 HCAB 与 HAB 块、分辨率等）能在稀疏监督下最大化 3D 占据预测？

主要发现

TPVFormer 在 nuScenes LiDAR 分割任务中仅使用 RGB 输入进行监督，达到与基于 LiDAR 的方法相当的 mIoU。
TPV 表征在三个平面上以 O(HW+DH+WD) 的存储保留了细粒度的三维结构，在上下文多样化方面优于 BEV。
在测试时提高 TPV 平面分辨率，模型能够捕捉到更细致的物体形状。
TPVFormer-Small 和 TPVFormer-Base 在参数量和 FLOPs 明显低于 MonoScene 的同时仍展示出强劲性能。
该方法能够预测密集的语义占据，并在验证数据上呈现占据一致的结果，有时超越地面实况 LiDAR 分割。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。