QUICK REVIEW

[论文解读] Decomposing NeRF for Editing via Feature Field Distillation

Sosuke Kobayashi, Eiichi Matsumoto|arXiv (Cornell University)|May 31, 2022

Advanced Vision and Imaging被引用 103

一句话总结

本文提出 Distilled Feature Fields (DFFs)，将 2D 图像特征编码器蒸馏为用于 NeRF 的三维特征场，从而实现零样本、基于查询的语义分解和局部编辑，而无需重新训练辐射场。

ABSTRACT

Emerging neural radiance fields (NeRF) are a promising scene representation for computer graphics, enabling high-quality 3D reconstruction and novel view synthesis from image observations. However, editing a scene represented by a NeRF is challenging, as the underlying connectionist representations such as MLPs or voxel grids are not object-centric or compositional. In particular, it has been difficult to selectively edit specific regions or objects. In this work, we tackle the problem of semantic scene decomposition of NeRFs to enable query-based local editing of the represented 3D scenes. We propose to distill the knowledge of off-the-shelf, self-supervised 2D image feature extractors such as CLIP-LSeg or DINO into a 3D feature field optimized in parallel to the radiance field. Given a user-specified query of various modalities such as text, an image patch, or a point-and-click selection, 3D feature fields semantically decompose 3D space without the need for re-training and enable us to semantically select and edit regions in the radiance field. Our experiments validate that the distilled feature fields (DFFs) can transfer recent progress in 2D vision and language foundation models to 3D scene representations, enabling convincing 3D segmentation and selective editing of emerging neural graphics representations.

研究动机与目标

在不重新训练的情况下实现 NeRF 的局部、面向对象的语义分解。
利用现成的 2D 特征编码器（如 CLIP-LSeg、DINO）作为教师对 3D 特征场进行蒸馏。
支持通过文本、图像补丁或其他模态进行基于查询的编辑。
在真实世界的 NeRF 场景中演示改进的 3D 分割和多视角一致的编辑。

提出的方法

在 NeRF 上扩展一个三维特征场 f(x)，除了密度 sigma(x) 和颜色 c(x,d)之外。
通过蒸馏沿光线呈现的特征来训练 f，使其与预训练图像编码器教师 f_img(I,r) 的特征相匹配（L_f 损失），并结合标准光度损失（L_p）。
通过 f(x) 与来自零样本标签空间的查询特征 f_q(l) 的点积来计算 3D 分割概率 p(l|x)（Eq. 5）。
启用基于查询的分解，以在不重新训练的情况下在多个 NeRF 之间选择性混合或编辑区域，使用由分割 p(l|x) 推导的混合权重。
演示包括文本、图像补丁、像素查询和聚类的交互模态用于区域选择，并可选与 CLIPNeRF 集成以增强编辑效果。

实验结果

研究问题

RQ1是否可以通过从 2D 预训练视觉模型蒸馏得到的三维特征场实现对 NeRF 场景的开放集、零样本语义分解？
RQ2是否可以在不重新训练辐射场的情况下，通过基于查询的分割编辑 NeRF 的特定区域？
RQ3将 2D 特征蒸馏到三维场对新视图合成质量和分割性能有何影响？
RQ4粗采样/细采样和 PE 对 3D 分解与编辑的质量与平滑度有何影响？

主要发现

指标	数值	数值	数值
Table 1: 3D semantic segmentation on Replica (mIoU)	Supervised 3DCNN	DFF (Coarse)	DFF (Fine)
mIoU	0.475	0.589	0.583
Accuracy	0.758	0.855	0.855
Table 2: Novel view synthesis and geometry on Replica	指标	数值	数值	数值	数值	数值
PSNR	–	32.87	32.85	–	–
SSIM	–	0.934	0.932	–	–
LPIPS	–	0.148	0.150	–	–
delta<1.25	–	0.993	0.993	–	–
absrel	–	0.018	0.017	–	–

DFFs 通过文本或图像查询实现对 NeRF 场景的三维语义分割，在 Replica 数据上达到有竞争力的 mIoU 和准确率。
基于 DFF 的分割在评估场景中在 mIoU 和准确性上可能超过有监督的点云模型 (MinkowskiNet42)。
通过基于查询的提示，展示了具有多视图一致性的外观编辑、删除、提取和几何变换的可编辑性。
粗糙训练和移除位置编码（no-PE）可实现更平滑的体积分解和更少的高频伪影，但在表示细小结构方面存在权衡。
将 DFF 与 CLIPNeRF 相结合可以实现局部编辑，避免对场景其他部分产生非预期的改变。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。