QUICK REVIEW

[论文解读] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Yisu Zhang, Chenjie Cao|arXiv (Cornell University)|Mar 2, 2026

Advanced Vision and Imaging被引用 0

一句话总结

WorldStereo 引入两种几何感知记忆模块（全局几何记忆 Global-Geometric Memory 与空间立体记忆 Spatial-Stereo Memory），实现多轨迹、相机引导的视频生成，能够产生连贯的 3D 重建，且通过蒸馏实现高效推理。

ABSTRACT

Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.

研究动机与目标

推动从相机引导的视频生成实现鲁棒的 3D 重建。
通过记忆机制将相机引导扩扩散模型与 3D 重建桥接。
在保持冻结 VDM 主干的同时，添加几何感知记忆以保持泛化。
实现适用于 3D 场景重建的长轨迹、多视角一致性。
为从相机引导生成的 3D 重建提供新基准。

提出的方法

在相机引导的 VDM（Uni3C）中扩展两种记忆模块：Global-Geometric Memory (GGM) 与 Spatial-Stereo Memory (SSM)。
GGM 逐步更新全局点云 3D 缓存，为多轨迹提供粗略几何先验。
SSM 通过检索参考视图并利用 3D 对应关系（点映射）来引导细粒度细节合成，执行几何感知的注意力。
使用 ControlNet 分支注入像素对齐条件，而无需重新训练整个扩散模型，以保持泛化能力。
通过 Distribution Matching Distillation (DMD) 对从冻结 VDM 主干中蒸馏出更快的 4-step DiT 生成器，从而实现高效推理。
通过 WorldMirror 式重建构造 3D 缓存，并使用 Umeyama 变换将缓存对齐，以实现多视图一致性。）

Figure 2 : Overview of WorldStereo. WorldStereo comprises two ControlNet branches. The camera branch ensures precise camera control and Global-Geometric Memory (GGM), depending on global point clouds; the Spatial-Stereo Memory (SSM) branch leverages retrieved reference frames and pointmap (3D corres

实验结果

研究问题

RQ1几何感知记忆（GGM 与 SSM）是否能够实现具有一致几何的多轨迹视频生成？
RQ2记忆增强对相机控制精度和基于相机引导的 VDM 的 3D 重建质量有何影响？
RQ3该方法能否推广至全景基于的输入和单视图输入的 3D 场景生成？
RQ4通过 DMD 实现的加速推理对质量和可控性有何影响？

主要发现

Method	F1-Score	AUC	RotErr	TransErr	ATE
Uni3C	0.424	0.378	0.362	0.1017	0.1572
Gen3C	0.416	0.380	0.342	0.0949	0.1704
SEVA	0.286	0.293	0.379	0.0949	0.1815
Lyra	0.227	0.193	–	–	–
VMem	0.386	0.375	0.533	0.1510	0.1922
WorldStereo*	0.447	0.389	0.377	0.0990	0.1545
WorldStereo-GGM	0.485	0.411	0.224	0.0885	0.1350
WorldStereo-Full	0.578	0.437	0.247	0.0927	0.1501
WorldStereo-DMD	0.534	0.410	0.291	0.1001	0.1547
MipNeRF360 - Uni3C	0.352	0.347	0.112	0.0086	0.0104
MipNeRF360 - Gen3C	0.356	0.340	0.349	0.0220	0.0318
MipNeRF360 - SEVA	0.332	0.311	0.282	0.0138	0.0295
MipNeRF360 - Lyra	0.203	0.263	–	–	–
MipNeRF360 - VMem	0.256	0.245	0.403	0.0392	0.0752
MipNeRF360 - WorldStereo*	0.350	0.342	0.097	0.0076	0.0099
MipNeRF360 - WorldStereo-GGM	0.342	0.346	0.107	0.0079	0.0206
MipNeRF360 - WorldStereo-Full	0.406	0.402	0.114	0.0080	0.0132
MipNeRF360 - WorldStereo-DMD	0.390	0.387	0.159	0.0106	0.0267

WorldStereo 在相机控制精度和视频生成质量方面优于基线。
GGM 提高跨轨迹的全局 3D 结构一致性，SSM 通过 3D 对应引导的注意力提升细粒度细节。
同时加入两种记忆模块在 Tanks&Temples 与 MipNeRF360 数据集上显著提升 3D 重建指标。
WorldStereo-DMD 在推理速度显著提升（4-step 蒸馏）的同时保持强的 3D 一致性。
一个新的单视图 3D 重建基准显示 WorldStereo 在对象中心、面向人脸与 360° 全景任务上的有效性。

Figure 3 : Spatial-Stereo Memory (SSM). Reference views are retrieved from the memory bank, while pointmaps for both target and reference views are constructed based on the 3D cache. In SSM attention, we horizontally stitch each target-reference pair and rearrange the tensor shape to make each targe

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。