QUICK REVIEW

[论文解读] Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

Shuo Sun, Unal Artan|arXiv (Cornell University)|Mar 12, 2026

Advanced Vision and Imaging被引用 0

一句话总结

两阶段优化框架用于来自多自由移动摄像头的密集动态场景重建与相机位姿估计，通过时空多摄像头跟踪、宽基线初始化和跟踪后深度 Refinement 实现。

ABSTRACT

We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.

研究动机与目标

解决从多个自由移动摄像头进行稳健的密集动态场景重建，且不需要刚性外参。
在重叠和不重叠视图之间实现一致的尺度和准确的相机位姿估计。
开发一个两阶段管线，将初始跟踪与密集深度 Refinement 分离，以提升鲁棒性和效率。
提供一个带有真实世界 ground-truth 位姿的多摄像头数据集，用于评测动态多视图重建方法。

提出的方法

通过时空连接图将单摄像头 SLAM 扩展到多摄像头设置，该连接图连接摄像头内时序和摄像头间时空重叠以进行联合优化。
使用带前馈重建模型的宽基线初始化策略，为全局尺度锚点和初始位姿提供初始信息。
通过优化密集的跨摄像头与内摄像头的一致性来细化深度和相机位姿，利用宽基线光流。
引入两阶段深度 Refinement，结合密集对应关系和每帧尺度/平移参数，以对齐跨摄像头的单目深度预测。
在优化过程中利用姿态正则化与时间平滑性来稳定在线 Refinement。

Figure 2 : Method Overview. Given multiple video inputs: Our method first uses a feed-forward model for initialization to achieve a global scale anchor and initialized poses (Step1). Then, we build a spatio-temporal connection graph during tracking to estimate camera poses and maintain a consistent

实验结果

研究问题

RQ1多摄像头自由移动的设置是否能够在没有预先标定的情况下实现鲁棒、 metrically 一致的密集场景重建？
RQ2如何通过时空连接在摄像头之间提升跟踪鲁棒性与尺度一致性，在动态场景中？
RQ3两阶段方法（初始跟踪+密集深度 Refinement）是否在更低内存需求下比完全前馈模型提供更好的重建质量？
RQ4在视野重叠有限的情况下，宽基线初始化对提升鲁棒性有何影响？
RQ5多视图深度 Refinement 与基于光流的约束在真实世界的多摄像头数据集上表现如何？

主要发现

所提出的方法在合成与真实基准上相较现有前馈模型取得更优的跟踪和重建结果。
在提供改进的位姿和深度精度的同时，所述方法的内存需求低于竞争的前馈方法。
时空连接图有效地利用摄像头内时序连续性和摄像头间的空间重叠，维持一致尺度。
通过 VGGT 的宽基线初始化 + 单目深度对齐，在具有挑战性的重叠场景中提供鲁棒的全局尺度锚定。
两阶段深度 Refinement，结合密集光流与逐帧尺度/平移优化，降低深度闪烁并提升多视图一致性。
该方法在带有来自运动捕捉的 ground-truth 位姿的新型 MultiCamRobolab 真实世界数据集上表现出色。

Figure 3 : Demonstration spatio-temporal graph. First, each camera will estimate temporal connections with its own frames. Second, at the timestamp $t_{0}$ , Cam.1 will try to make a spatial connection with Cam.0 if there is enough overlap. Additionally, the current active keyframe will try to make

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。