QUICK REVIEW

[论文解读] ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Kyle Sargent, Zizhang Li|arXiv (Cornell University)|Oct 27, 2023

Advanced Vision and Imaging被引用 15

一句话总结

ZeroNVS 在多样化真实场景数据上训练一个 3D 感知扩散模型，从单张图像合成360度新视图，引入相机条件与 SDS 锚定，以在 DTU 和 Mip-NeRF 360 基准测试中实现零-shot 的最先进性能。

ABSTRACT

We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds, we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically, we train a generative prior on a mixture of data sources that capture object-centric, indoor, and outdoor scenes. To address issues from data mixture such as depth-scale ambiguity, we propose a novel camera conditioning parameterization and normalization scheme. Further, we observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during distillation of 360-degree scenes, and propose "SDS anchoring" to improve the diversity of synthesized novel views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting, even outperforming methods specifically trained on DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark for single-image novel view synthesis, and demonstrate strong performance in this setting. Our code and data are at http://kylesargent.github.io/zeronvs/

研究动机与目标

面向野外场景中的多物体和复杂背景，解决零-shot 的 360 度新视图合成。
开发一个基于扩散的先验，基于大量真实场景数据集混合（CO3D、RealEstate10K、ACID）训练，以应对多样几何形状和背景。
创建鲁棒的相机条件化方案和场景归一化，以解决尺度歧义并提升3D一致性。
通过SDS锚定来缓解基于SDS蒸馏的背景多样性损失，促进多样且可信的背景。
展示强大的零-shot 泛化能力，并建立一个新的场景级单图像 NVS 基准（Mip-NeRF 360）。

提出的方法

训练一个二维条件扩散模型，然后进行三维 SDS 蒸馏，以从单张图像获得3D一致的新视图。
引入一个6DoF+1 相机条件表示（含视场的相对位姿）和一个以观察者为中心的归一化方案，以处理野外数据中的尺度与姿势变化。
使用深度统计得到的场景尺度参数 q 来归一化相机平移，以降低尺度歧义。
在训练期间应用深度填充的密集图以实现一致的尺度估计，通过一种新的归一化，称为 6DoF+1, viewer，使跨数据集的条件统一。
提出 SDS 锚定：使用 DDIM 采样多个锚视图，然后在 SDS 优化期间将最近的锚视图用作条件，以增加背景多样性。
在 DTU 上进行 LPIPS、PSNR、SSIM 的零-shot 评估，并将 Mip-NeRF 360 作为一个新的零-shot 场景级 NVS 基准进行引入。

Figure 2: A 3DoF camera pose captures camera elevation, azimuth, and radius for a camera pointed at the origin but is incapable of representing a camera’s roll (pictured) or cameras oriented arbitrarily in space. A model with this parameterization cannot be trained on real-world data, where many of

实验结果

研究问题

RQ1如何通过在多样真实场景上训练的基于扩散的先验，使得从单张图像实现零-shot 的360度新视图合成？
RQ2在野外场景中，哪些相机条件化和归一化策略最能应对尺度和姿势的不确定性？
RQ3在不牺牲3D一致性的前提下，能否在基于SDS的蒸馏用于场景级NVS 时提升背景多样性？

主要发现

ZeroNVS 在 DTU 的零-shot 设置下实现最先进的 LPIPS，超过在 DTU 上训练的方法。
在 Mip-NeRF 360 上，ZeroNVS 在零-shot 基线中取得最佳 LPIPS。
以深度信息驱动的尺度归一化的观察者中心 6DoF+1 条件化，提升 2D-to-3D 条件化与跨多样数据集（CO3D、ACID、RealEstate10K）的泛化。
SDS 锚定提高背景多样性，在用户研究中被认为更真实、更具创造性。
消融实验表明，在 CO3D、ACID、RealEstate10K 的混合训练，在所有评估数据集上均提升性能。

Figure 3: To a monocular camera, a small object close to the camera (left) and a large object at a distance (right) appear identical, despite representing different scenes. Scale ambiguity in the input view causes ambiguity in novel view synthesis. Specifically, even after conditioning on the image

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。