QUICK REVIEW

[論文レビュー] Decomposing NeRF for Editing via Feature Field Distillation

Sosuke Kobayashi, Eiichi Matsumoto|arXiv (Cornell University)|May 31, 2022

Advanced Vision and Imaging被引用数 103

ひとこと要約

論文は Distilled Feature Fields (DFFs) を導入し、2D画像特徴エンコーダを3D特徴場へ蒸留して NeRFs のゼロショット、クエリベースの意味的分解と局所編集を再学習なしで実現する。

ABSTRACT

Emerging neural radiance fields (NeRF) are a promising scene representation for computer graphics, enabling high-quality 3D reconstruction and novel view synthesis from image observations. However, editing a scene represented by a NeRF is challenging, as the underlying connectionist representations such as MLPs or voxel grids are not object-centric or compositional. In particular, it has been difficult to selectively edit specific regions or objects. In this work, we tackle the problem of semantic scene decomposition of NeRFs to enable query-based local editing of the represented 3D scenes. We propose to distill the knowledge of off-the-shelf, self-supervised 2D image feature extractors such as CLIP-LSeg or DINO into a 3D feature field optimized in parallel to the radiance field. Given a user-specified query of various modalities such as text, an image patch, or a point-and-click selection, 3D feature fields semantically decompose 3D space without the need for re-training and enable us to semantically select and edit regions in the radiance field. Our experiments validate that the distilled feature fields (DFFs) can transfer recent progress in 2D vision and language foundation models to 3D scene representations, enabling convincing 3D segmentation and selective editing of emerging neural graphics representations.

研究の動機と目的

NeRFの意味的分解を、再学習なしで局所的・オブジェクト中心の編集に対応させる。
倣用の2D特徴エンコーダ（例：CLIP-LSeg、DINO）を教師として用い、3D特徴場を蒸留する。
テキスト、画像パッチ、他のモダリティによるクエリベースの編集をサポートする。
実世界の NeRF シーンで3D分割と多視点整合の編集性を改善して示す。

提案手法

ノイズ密度 sigma(x) と色 c(x,d) に加えて 3D特徴場 f(x) を NeRF に拡張する。
レイに沿ってレンダリングされた特徴を事前学習済み画像エンコーダ教師 f_img(I,r) の特徴と一致させるように蒸留し（L_f 損失）標準のフォトメトリック損失（L_p）と共に f を訓練する。
3D分割確率 p(l|x) を、f(x) とゼロショットラベル空間からのクエリ特徴 f_q(l) との内積（式(5)）で計算する。
分割 p(l|x) から導かれるブレンドウェイトを用いて、再学習なしで複数の NeRF 間の領域を選択的にブレンドまたは編集するためのクエリベースの分解を可能にする。
テキスト、画像パッチ、ピクセルクエリ、領域選択のクラスタリングなどの対話モードを示し、強化編集のための CLIPNeRF との統合も任意で可能。

実験結果

リサーチクエスチョン

RQ12D の事前学習 Vision モデルから蒸留された 3D 特徴場は、NeRF シーンの開放集合・ゼロショット意味分解を実現できるか？
RQ2クエリベースの分割により、再学習なしに NeRF の特定領域を編集できるか？
RQ32D feature を 3D フィールドへ蒸留することは、新規ビュー合成の品質と分割性能にどのような影響を与えるか？
RQ4粗いサンプリングと PE の影響は、3D 分解と編集の品質と滑らかさにどのように影響するか？

主な発見

指標	値	値	値
Table 1: Replica における 3D セマンティックセグメンテーション (mIoU)	教師あり 3DCNN	DFF (粗い)	DFF (細かい)
mIoU	0.475	0.589	0.583
Accuracy	0.758	0.855	0.855
Table 2: Replica における新規ビュー合成と幾何	Metric	Value	Value	Value	Value	Value
PSNR	–	32.87	32.85	–	–
SSIM	–	0.934	0.932	–	–
LPIPS	–	0.148	0.150	–	–
delta<1.25	–	0.993	0.993	–	–
absrel	–	0.018	0.017	–	–

DFFs はテキストや画像クエリを用いた NeRF シーンの 3D 意味的分割を可能にし、Replica データで競争的な mIoU と精度を達成する。
DFF ベースの分割は、評価対象のシーンで監視付きポイントクラウドモデル（MinkowskiNet42）よりも mIoU と精度で上回ることがある。
クエリに誘導されたマルチビュー整合的外観編集、削除、抽出、幾何変換を用いた編集性を実演。
粗い訓練と位置エンコーディングの除去（no-PE）は、体積分解を滑らかにし高周波アーティファクトを減らすが、細部表現にはトレードオフがある。
DFF と CLIPNeRF の組み合わせは、他のシーン部分へ影響を及ぼさずに局所編集を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。