QUICK REVIEW

[论文解读] MagicDrive: Street View Generation with Diverse 3D Geometry Control

Ruiyuan Gao, Kai Chen|arXiv (Cornell University)|Oct 4, 2023

Spatial Cognition and Navigation被引用 10

一句话总结

MagicDrive 生成基于 BEV/3D 注释的多视角街景图像，对道路和对象采用分离编码，并加上跨视角注意力以确保多摄像头的一致性，从而通过合成数据提升 BEV 分割与 3D 目标检测。

ABSTRACT

Recent advancements in diffusion models have significantly enhanced the data synthesis with 2D control. Yet, precise 3D control in street view generation, crucial for 3D perception tasks, remains elusive. Specifically, utilizing Bird's-Eye View (BEV) as the primary condition often leads to challenges in geometry control (e.g., height), affecting the representation of object shapes, occlusion patterns, and road surface elevations, all of which are essential to perception data synthesis, especially for 3D object detection tasks. In this paper, we introduce MagicDrive, a novel street view generation framework, offering diverse 3D geometry controls including camera poses, road maps, and 3D bounding boxes, together with textual descriptions, achieved through tailored encoding strategies. Besides, our design incorporates a cross-view attention module, ensuring consistency across multiple camera views. With MagicDrive, we achieve high-fidelity street-view image & video synthesis that captures nuanced 3D geometry and various scene descriptions, enhancing tasks like BEV segmentation and 3D object detection.

研究动机与目标

动机：在自动驾驶中高成本数据收集的挑战，以及对真实且可控的街景合成的需求。
发展基于扩散的框架，利用 3D 几何（BEV 地图、3D 框）和场景文本生成多视角街景图像。
通过跨视角注意力模块以及对道路和对象的分离编码实现更好的多视角一致性。
实现对属性的灵活控制（天气、时段等），并展示对下游感知任务的数据增强效益。

提出的方法

在潜在扩散模型（Stable Diffusion）上进行场景和相机位姿条件化操作。
通过跨注意力和带傅里叶嵌入位姿的 CLIP 文本编码器，对场景级信息（文本和相机位姿）进行编码。
通过一个独立的跨注意力路径对 3D 边界框进行编码，使用类别嵌入和傅里叶嵌入的框坐标。
用一个附加的编码分支对道路地图进行编码，以注入网格状的 BEV 信息。
引入跨视角注意力模块，在相邻摄像机视图之间传播信息以实现一致性。
使用分类器自由引导（CFG）进行训练，并用不可见框进行数据增强，以提升几何变换能力。

Figure 1: Multi-camera street view generation from MagicDrive . MagicDrive can generate continuous camera views with controls from the road map, object boxes, and text ( e.g . , weather).

实验结果

研究问题

RQ1如何在保持跨摄像头多视角一致性的前提下，从 BEV 地图和 3D 边界框生成街景图像？
RQ2与仅 BEV 条件相比，单独对道路地图和 3D 框进行编码是否能提高可控性与真实感？
RQ3当用作数据增强时，基于扩散的街景合成是否能提升下游 BEV 分割和 3D 目标检测？
RQ4跨视角注意力对多摄像头一致性和感知任务性能有何影响？

主要发现

Method	Synthesis resolution	FID ↓	BEV segmentation	3D object detection	Road mIoU ↑	Vehicle mIoU ↑	mAP ↑	NDS ↑
Oracle	-	-	72.21	33.66	35.54	41.21	-	-
Oracle	224 × 400	-	72.19	33.61	23.54	31.08	-	-
BEVGen	224 × 400	25.54	50.20	5.89	-	-	-	-
BEVControl	-	24.85	60.80	26.80	-	-	-	-
MagicDrive	224 × 400	16.20	61.05	27.01	12.30	23.32	-	-
MagicDrive	272 × 736	16.59	54.24	31.05	20.85	30.26	-	-

MagicDrive 相较基线获得更低的 FID，表示更高的真实感（在 224x400 时 FID 为 16.20；在 272x736 时为 16.59）。
BEV 分割，在 BEV 方面的提升，在 224x400 下车辆 mIoU 为 61.05，在 272x736 下为 54.24；在表格语境中，道路 mIoU 分别达到 12.30 和 20.85。
使用 MagicDrive 合成数据的 3D 目标检测性能提升（在 224x400 下 mAP=27.01；NDS=23.32；在更高分辨率设置下观察到 mIoU 增益）。
在 BEVFusion 与 CVT 的评估中，MagicDrive 胜过 BEVGen 与 BEVControl，展现出更强的可控性和真实感。
消融结果显示，与移除盒编码器相比，单独的 E_box 编码和 f_viz 增强可提升车辆和道路分割指标。
CFG 研究揭示了最佳引导设置；在 CFG 过程中关闭某些条件可以提升道路 mIoU，同时影响车辆引导，指示 CFG 的权衡。

Figure 2: 3D bounding boxes are crucial for street view synthesis. Two examples show that 2D boxes or BEV maps lost distance, height, and elevation. Images are generated from MagicDrive .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。