QUICK REVIEW

[论文解读] CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

Runsheng Xu, Zhengzhong Tu|arXiv (Cornell University)|Jul 5, 2022

Advanced Neural Network Applications被引用 78

一句话总结

CoBEVT 引入一个通用的多智能体、多摄像头框架，用于合作式 BEV 语义分割，采用融合的轴向（FAX）稀疏变换器，在 OPV2V 上达到最先进的结果，并展示对单智能体 BEV 和多智能体 LiDAR 任务的泛化能力。

ABSTRACT

Bird's eye view (BEV) semantic segmentation plays a crucial role in spatial sensing for autonomous driving. Although recent literature has made significant progress on BEV map understanding, they are all based on single-agent camera-based systems. These solutions sometimes have difficulty handling occlusions or detecting distant objects in complex traffic scenes. Vehicle-to-Vehicle (V2V) communication technologies have enabled autonomous vehicles to share sensing information, dramatically improving the perception performance and range compared to single-agent systems. In this paper, we propose CoBEVT, the first generic multi-agent multi-camera perception framework that can cooperatively generate BEV map predictions. To efficiently fuse camera features from multi-view and multi-agent data in an underlying Transformer architecture, we design a fused axial attention module (FAX), which captures sparsely local and global spatial interactions across views and agents. The extensive experiments on the V2V perception dataset, OPV2V, demonstrate that CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation. Moreover, CoBEVT is shown to be generalizable to other tasks, including 1) BEV segmentation with single-agent multi-camera and 2) 3D object detection with multi-agent LiDAR systems, achieving state-of-the-art performance with real-time inference speed. The code is available at https://github.com/DerrickXuNu/CoBEVT.

研究动机与目标

促进协作感知，以克服单智能体 BEV 系统中的遮挡和深度受限问题。
开发一个通用的基于 Transformer 的框架，用于融合多视角、多智能体摄像头特征以进行 BEV 分割。
设计适合 V2V 通信约束的内存和计算高效的融合模块。
展示对单智能体 BEV 分割和多智能体 LiDAR 基于的 3D 检测的泛化能力。

提出的方法

提出 SinBEVT，用于从每个智能体的多视角摄像头图像计算高分辨率的 BEV 特征。
引入 FuseBEVT，一种用于多智能体 BEV 特征融合的 3D 融合轴向注意力（FAX） Transformer，具有局部（3D 窗口）和稀疏全局注意力。
为不同感知设置，使用 FAX-SA（自注意力）和 FAX-CA（交叉注意力）变体增强 FAX。
实现一个轻量级的 1x1 自编码器，用于在 V2V 广播前对 BEV 特征进行压缩，接收端再进行可微分的几何扭曲。
使用 BEV 嵌入作为查询，以高分辨率查询相机特征，并采用摄像头感知的位置编码以学习几何对应关系。
使用 Adam、余弦退火和加权交叉熵损失，对整个 CoBEVT 流水线进行端到端训练。

实验结果

研究问题

RQ1在遮挡或距离较远的场景中，多智能体多摄像头的 BEV 分割能否优于单智能体多摄像头方法？
RQ2稀疏融合轴向注意力（FAX）在跨智能体和视图聚合 BEV 特征时的计算量可控吗？
RQ3协作 BEV 融合对单智能体 BEV 任务和基于 LiDAR 的 3D 检测的泛化能力如何？
RQ4特征压缩和协作智能体数量对性能与延迟有何影响？

主要发现

方法	车辆	可通行区	车道
No Fusion	37.7	57.8	43.7
Map Fusion	45.1	60.0	44.1
F-Cooper	52.5	60.4	46.5
AttFuse	51.9	60.5	46.2
V2VNet	53.5	60.2	47.5
DiscoNet	52.9	60.7	45.8
FuseBEVT	59.0	62.1	49.2
CoBEVT	60.4	63.0	53.0

CoBEVT 在 OPV2V 摄像头跟踪任务上，车辆 IoU 为 60.4，驾乘区域 IoU 为 63.0，车道 IoU 为 53.0，超越所有基线。
FuseBEVT 在比下一个最佳方法的 IoU 提升上显示显著优势，车辆提升 5.5%、可通行区域提升 1.6%、车道提升 3.4%。
用 SinBEVT 替代 CVT 进行特征提取可在各类别上再获得最高 3.8% 的提升。
在 OPV2V LiDAR 跟踪任务中，基于 CoBEVT 的融合在 IoU 0.7 下实现 AP 85.2，超越以往方法，且在 64x 特征压缩下仍鲁棒（AP 84.9）。
NuScenes 车辆地图视图结果显示 SinBEVT 在 RTX2080 上达到 37.1 IoU、35 FPS，展示了具有竞争精度的实时性能。
消融研究显示局部与全局 FAX 均对性能有显著贡献；当舍弃多台相机/智能体时，CoBEVT 仍然有益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。