QUICK REVIEW

[论文解读] Neural 3D Video Synthesis

Tianye Li|arXiv (Cornell University)|Jan 1, 2024

Advanced Vision and Imaging参考文献 74被引用 40

一句话总结

本文提出了一种时间条件化的神经辐射场，通过使用紧凑的潜在编码，从多视角视频中表示动态3D场景，实现了高保真度、高分辨率的新视角合成。通过引入分层训练方案和基于时间变化的重要性采样，该方法实现了快速收敛，并在10秒30 FPS视频上达到28MB的模型大小，其在质量与效率方面均优于先前的工作。

ABSTRACT

We propose a novel approach for 3D video synthesis that is able to represent multi-view video recordings of a dynamic real-world scene in a compact, yet expressive representation that enables high-quality view synthesis and motion interpolation. Our approach takes the high quality and compactness of static neural radiance fields in a new direction: to a model-free, dynamic setting. At the core of our approach is a novel time-conditioned neural radiance fields that represents scene dynamics using a set of compact latent codes. To exploit the fact that changes between adjacent frames of a video are typically small and locally consistent, we propose two novel strategies for efficient training of our neural network: 1) An efficient hierarchical training scheme, and 2) an importance sampling strategy that selects the next rays for training based on the temporal variation of the input videos. In combination, these two strategies significantly boost the training speed, lead to fast convergence of the training process, and enable high quality results. Our learned representation is highly compact and able to represent a 10 second 30 FPS multi-view video recording by 18 cameras with a model size of just 28MB. We demonstrate that our method can render high-fidelity wide-angle novel views at over 1K resolution, even for highly complex and dynamic scenes. We perform an extensive qualitative and quantitative evaluation that shows that our approach outperforms the current state of the art. We include additional video and information at: this https URL

研究动机与目标

从多视角视频记录中实现高质量、视角一致的动态3D场景合成。
在不依赖显式3D几何或运动模型的前提下，以紧凑的神经表示形式捕捉复杂场景动态。
通过利用时间一致性和自适应采样，加速训练并提升动态神经辐射场的收敛速度。
在复杂且快速运动的场景中，实现1K分辨率的高保真度新视角渲染。
在定性和定量评估中均展示最先进水平的性能。

提出的方法

该方法引入了一种时间条件化的神经辐射场，通过紧凑的潜在编码来表示场景动态。
采用分层训练方案，以高效地在空间和时间维度上优化网络。
提出一种重要性采样策略，根据输入视频中的时间变化选择训练光线，聚焦于动态变化剧烈的区域。
在来自18台摄像机的多视角视频上端到端训练模型，学习在时空上预测辐射度和体密度。
该表示形式极为紧凑，在10秒30 FPS视频上实现了28MB的模型大小。
该框架支持在高分辨率（1K）下进行推理，即使在复杂且动态的场景中也能保持高保真度。

实验结果

研究问题

RQ1一种无模型的神经表示能否有效捕捉并合成从多视角视频中获取的动态3D场景？
RQ2如何提升动态神经辐射场的训练效率与收敛速度？
RQ3在不使用显式运动建模的前提下，紧凑的潜在编码在多大程度上能够表示复杂场景动态？
RQ4该方法能否在保持1K分辨率下高保真度新视角合成的同时，泛化至高度动态的场景？
RQ5在定性和定量指标上，该方法与最先进方法相比表现如何？

主要发现

该方法在18台摄像机拍摄的10秒30 FPS多视角视频序列上，仅需28MB的模型大小。
由于采用了分层训练方案和基于时间变化的重要性采样，训练过程相比先前方法显著加快收敛。
即使在复杂且快速变化的场景中，该方法仍能实现超过1K分辨率的高保真度新视角渲染。
大量评估表明，该方法在定性和定量指标上均优于当前最先进水平。
该方法在无需显式运动建模的情况下，对广角视角和复杂动态表现出强大的泛化能力。
基于时间变化的重要性采样策略通过聚焦于高动态内容区域，显著提升了训练效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。