QUICK REVIEW

[论文解读] MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Wei Yu, Runjia Qian|arXiv (Cornell University)|Mar 17, 2026

Advanced Vision and Imaging被引用 0

一句话总结

MosaicMem 引入了一种混合时空记忆，将图像块提升到3D以实现精确定位，同时通过隐式检索进行条件化，使记忆驱动的场景编辑具备可控性与长时程视频生成能力。

ABSTRACT

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

研究动机与目标

推动在相机运动与重访情境下仍保持连贯性的持久、可控视频世界模型的研究
研究纯显式或纯隐式记忆在动态场景中的局限性
提出 Mosaic Mem 作为一种基于补丁的混合记忆，结合显式3D提升与隐式条件化

提出的方法

引入 Mosaic Memory，一种基于补丁的记忆单元，提升到3D以实现定位，然后作为通过隐式条件化的参考信号使用
使用 Warped RoPE 与 Warped Latent 通过几何信息投影将记忆补丁与当前视角对齐
将 PRoPE 作为相机条件化接口，在生成过程中的视点可控性方面提升
在 TI2V（文本+图像到视频）框架内通过带有概率流 ODE 的神经向量场提供一个与记忆对齐的生成管线
创建 MosaicMem-World 数据集，强调重访和长距离记忆检索以用于评估
实现记忆操控与自回归长距离生成（Mosaic Forcing），以实现实时性能

实验结果

研究问题

RQ1MosaicMem 是否能在相机运动更准确、长距离一致性更稳定方面优于现有的显式或隐式记忆基线？
RQ2基于补丁的 Mosaic Memory 是否在保持任务提示跟随的同时对移动对象的鲁棒处理？
RQ3Warped RoPE、Warped Latent 等基于扭曲的对齐方法对记忆对齐和视觉保真度的贡献如何？
RQ4PRoPE 相机条件化对视点可控性与记忆引导生成有何影响？
RQ5记忆在场景编辑和扩展自回归生成中的可操控程度如何？

主要发现

Method	RotErr (°) ↓	TransErr ↓	FID ↓	FVD ↓	SSIM ↑	PSNR ↑	LPIPS ↓	Dynamic ↑
MosaicMem (full)	0.51	0.06	65.67	232.95	0.75	23.57	0.11	2.58

MosaicMem 在相机运动精度方面优于隐式记忆基线，在处理动态对象方面也比显式记忆基线更鲁棒
完整的 MosaicMem（同时使用 Warped RoPE 与 Warped Latent）在相机控制、视觉质量与记忆检索的综合性能上达到最佳
混合扭曲策略在记忆条件化方面最鲁棒，且在自回归生成中能最小化伪影
MosaicMem 支持分钟级导航、持久记忆并通过补丁级操作实现场景编辑
Mosaic Forcing 在640x360分辨率下实现实时自回归生成（16 FPS），相比 RELIC 与 Matrix-Game 2.0具备更高的质量与一致性

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。