[論文レビュー] ShaRF: Shape-conditioned Radiance Fields from a Single View
本論文は、ShaRFと名付ける2段階の形状・外観を分離したニューラルレンダリングフレームワークを提案する。体素化された形状の足場を用いて放射場を条件づけ、単一画像からの物体再構成と新規視点合成を実現し、現実的なレンダリングや実写真への一般化を可能にする。
We present a method for estimating neural scenes representations of objects given only a single image. The core of our method is the estimation of a geometric scaffold for the object and its use as a guide for the reconstruction of the underlying radiance field. Our formulation is based on a generative process that first maps a latent code to a voxelized shape, and then renders it to an image, with the object appearance being controlled by a second latent code. During inference, we optimize both the latent codes and the networks to fit a test image of a new object. The explicit disentanglement of shape and appearance allows our model to be fine-tuned given a single image. We can then render new views in a geometrically consistent manner and they represent faithfully the input object. Additionally, our method is able to generalize to images outside of the training domain (more realistic renderings and even real photographs). Finally, the inferred geometric scaffold is itself an accurate estimate of the object's 3D shape. We demonstrate in several experiments the effectiveness of our approach in both synthetic and real images.
研究の動機と目的
- Estimate neural scene representations of objects from a single image by building a geometric voxel scaffold to guide radiance field reconstruction.
- Disentangle shape and appearance to enable robust fine-tuning and better generalization across domains.
- Render geometrically consistent novel views and recover accurate 3D shape from minimal input.
- Demonstrate generalization to more realistic renderings and real photographs beyond training domains.
- Provide an optimization-based inference procedure that jointly refines latent codes and networks on a test image.
提案手法
- A shape network G maps a latent code to a 3D voxel grid V representing object occupancy.
- An appearance network F estimates a radiance field conditioned on V via occupancy αp and an appearance latent code φ, producing color c and density σ for any 3D point p and view direction d.
- Radiance field rendering follows volume rendering as in NeRF, with ray casting and accumulation to synthesize pixels.
- Training uses ShapeNet objects with latent codes θ (shape) and φ (appearance), plus losses: voxel BCE with occupancy, symmetry loss, and projection loss to object silhouettes from two views.
- Inference optimizes θ, φ and refines G and F to match a test image, in a two-stage process: Stage 1 optimizes θ, G, φ with F fixed; Stage 2 optimizes φ and F with θ and G fixed, enabling fine-tuning for real images.]
- research_questions: [
実験結果
リサーチクエスチョン
- RQ1Can a latent, shape-conditioned radiance field learned from single-view images render accurate novel views of unseen objects?
- RQ2Does disentangling geometry and appearance improve generalization to realistic renderings and real photos?
- RQ3How does jointly inferring and fine-tuning shape and appearance networks on a single test image compare to only optimizing latent codes?
- RQ4Can a voxelized geometric scaffold guide surface-focused appearance synthesis to improve rendering fidelity?
- RQ5What is the performance of ShaRF variants against existing single-image NeRF-based methods across synthetic and real datasets?
主な発見
| Variant | PSNR (code-only) | SSIM (code-only) | PSNR (code+network) | SSIM (code+network) |
|---|---|---|---|---|
| V1. Conditional NeRF | 22.12 | 0.90 | 22.05 | 0.91 |
| V2. ShapeFromNR | 23.37 | 0.92 | 23.31 | 0.92 |
| V3. ShapeFromMask | 22.94 | 0.91 | 22.98 | 0.91 |
| V4. ShapeFromGT | 25.59 | 0.94 | 25.65 | 0.94 |
- ShaRF variants with a shape scaffold outperform a code-only baseline on ShapeNet-SRN chairs and cars in PSNR/SSIM, with V2 achieving 23.31–23.37 PSNR and SSIM 0.92 on chairs.
- On ShapeNet-Realistic, shape-scaffold variants (V3, V4) surpass the code-only variant, with V4 reaching 25.65 PSNR and SSIM 0.94.
- On Pix3D, ShapeFromMask (shape scaffold from segmentation) with code+network optimization yields strong rendering quality and competitive results against pixelNeRF.
- ShapeFromNR and ShapeFromMask variants demonstrate better generalization to more realistic renderings and real images than Conditional NeRF alone."
- The two-stage inference procedure (first refine shape and its network, then refine appearance and renderer) significantly improves reconstruction quality, especially for non-training-domain inputs
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。