QUICK REVIEW

[論文レビュー] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Yisu Zhang, Chenjie Cao|arXiv (Cornell University)|Mar 2, 2026

Advanced Vision and Imaging被引用数 0

ひとこと要約

WorldStereoは2つのジオメトリ認識メモリーモジュール（Global-Geometric MemoryとSpatial-Stereo Memory）を導入し、複数経路とカメラ導 guided な動画生成を実現。蒸留を用いた効率的推論で3D再構成を一貫させる。

ABSTRACT

Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.

研究の動機と目的

カメラガイド動画生成から頑健な3D再構成を動機づける。
カメラガイド拡散モデルと3D再構成をメモリ機構で橋渡しする。
凍結したVDMバックボーンを維持しつつジオメトリ認識メモリを追加して一般化を保つ。
長い軌道と多視点の一貫性を3Dシーン再構成に適合させる。
カメラガイド生成からの3D再構成を評価する新しいベンチマークを提供する。

提案手法

カメラガイドVDM（Uni3C）を2つのメモリーモジュール：Global-Geometric Memory (GGM)とSpatial-Stereo Memory (SSM)で拡張。
GGMは複数経路にまたがる粗い幾何先验を提供するためにグローバル点群3Dキャッシュを逐次更新。
SSMは参照ビューを取得し3D対応点群を用いた3D幾何認識アテンションを適用して細部の合成を指示。
ControlNetブランチを用いてピクセル揃え条件を再訓練なしで注入し、一般化を保つ。
Distribution Matching Distillation (DMD) を用いて凍結済みVDMバックボーンから4段階のDiT生成器を蒸留し、推論を効率化。
WorldMirror風の再構成で3Dキャッシュを構築し、Umeyama変換で多視点整合性を取る。

Figure 2 : Overview of WorldStereo. WorldStereo comprises two ControlNet branches. The camera branch ensures precise camera control and Global-Geometric Memory (GGM), depending on global point clouds; the Spatial-Stereo Memory (SSM) branch leverages retrieved reference frames and pointmap (3D corres

実験結果

リサーチクエスチョン

RQ1ジオメトリ認識メモリ（GGMとSSM）は多視点経路での一貫した3D幾何を伴う動画生成を可能にするか。
RQ2メモリ拡張はカメラ操作の精度と3D再構成品質にどう影響するか。
RQ3 panorameaベースおよび単視点入力への一般化は可能か。
RQ4DMDによる推論高速化が品質と操作性に与える影響は何か。

主な発見

Method	F1-Score	AUC	RotErr	TransErr	ATE
Uni3C	0.424	0.378	0.362	0.1017	0.1572
Gen3C	0.416	0.380	0.342	0.0949	0.1704
SEVA	0.286	0.293	0.379	0.0949	0.1815
Lyra	0.227	0.193	–	–	–
VMem	0.386	0.375	0.533	0.1510	0.1922
WorldStereo*	0.447	0.389	0.377	0.0990	0.1545
WorldStereo-GGM	0.485	0.411	0.224	0.0885	0.1350
WorldStereo-Full	0.578	0.437	0.247	0.0927	0.1501
WorldStereo-DMD	0.534	0.410	0.291	0.1001	0.1547
MipNeRF360 - Uni3C	0.352	0.347	0.112	0.0086	0.0104
MipNeRF360 - Gen3C	0.356	0.340	0.349	0.0220	0.0318
MipNeRF360 - SEVA	0.332	0.311	0.282	0.0138	0.0295
MipNeRF360 - Lyra	0.203	0.263	–	–	–
MipNeRF360 - VMem	0.256	0.245	0.403	0.0392	0.0752
MipNeRF360 - WorldStereo*	0.350	0.342	0.097	0.0076	0.0099
MipNeRF360 - WorldStereo-GGM	0.342	0.346	0.107	0.0079	0.0206
MipNeRF360 - WorldStereo-Full	0.406	0.402	0.114	0.0080	0.0132
MipNeRF360 - WorldStereo-DMD	0.390	0.387	0.159	0.0106	0.0267

WorldStereoはベースラインと比べ、カメラ制御精度と高品質な動画生成で優位性を示す。
GGMは経路間でのグローバル3D構造の一貫性を向上させ、SSMは3D対応に基づくアテンションで細部の品質を高める。
両方のメモリモジュールを組み合わせると、Tanks&TemplesとMipNeRF360データセットで3D再構成指標が大幅に改善。
WorldStereo-DMDは4段階蒸留で推論を大幅に高速化しつつ強い3D一貫性を維持。
新たな単一視点3D再構成ベンチマークは、オブジェクト中心、顔向き、そして360°パノラマタスクにおけるWorldStereoの有効性を示す。

Figure 3 : Spatial-Stereo Memory (SSM). Reference views are retrieved from the memory bank, while pointmaps for both target and reference views are constructed based on the 3D cache. In SSM attention, we horizontally stitch each target-reference pair and rearrange the tensor shape to make each targe

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。