QUICK REVIEW

[論文レビュー] MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Yuedong Chen, Haofei Xu|arXiv (Cornell University)|Mar 21, 2024

Medical Image Segmentation Techniques被引用数 5

ひとこと要約

MVSplatは、疎な多視点画像からガウス中心とパラメータを学習する、コスト-ボリュームベースのフィードフォワード3Dガウス拡散モデルを導入し、先行手法よりはるかに少ないパラメータと高速推論で最先端のレンダリング品質を達成します。

ABSTRACT

We introduce MVSplat, an efficient model that, given sparse multi-view images as input, predicts clean feed-forward 3D Gaussians. To accurately localize the Gaussian centers, we build a cost volume representation via plane sweeping, where the cross-view feature similarities stored in the cost volume can provide valuable geometry cues to the estimation of depth. We also learn other Gaussian primitives' parameters jointly with the Gaussian centers while only relying on photometric supervision. We demonstrate the importance of the cost volume representation in learning feed-forward Gaussians via extensive experimental evaluations. On the large-scale RealEstate10K and ACID benchmarks, MVSplat achieves state-of-the-art performance with the fastest feed-forward inference speed (22~fps). More impressively, compared to the latest state-of-the-art method pixelSplat, MVSplat uses $10 imes$ fewer parameters and infers more than $2 imes$ faster while providing higher appearance and geometry quality as well as better cross-dataset generalization.

研究の動機と目的

非常に疎なマルチビュー入力からの効率的な3Dシーン再構成と新規視点合成を動機づける。
高速で微分可能なレンダリングを実現するために3Dガウススプラッティング表現を活用する。
ガウス中心を局在化するためのコストボリュームベースの深度と幾何学学習モジュールを導入する。
フォトメトリック監視の下で、中心と同時にガウスパラメータ（不透明度、共分散、カラー）を予測する。
従来手法と比較して強力なデータセット横断の一般化と効率を示す。

提案手法

3D空間で平面走査によってコストボリュームを構築し、深度候補の間のビュー間特徴類似性を捕捉する。
複数視点Transformerを用いてビュー間の特徴を抽出・統合し、ビュー整合的な深度予測を可能にする。
軽量な2D U-Netとクロスビューアテンションでコストボリュームを洗練させ、テクスチャが欠如する領域に対処する。
深度マップをアンプロジェクションして3Dガウス中心を得て、不透明度、共分散（スケールと回転）、カラー（球面調和関数）を予測する。
予測パラメータを用いて微分可能な3Dガウススプラッティングで新規視点をレンダリングする。
レンダリング画像と地上真実画像の間のフォトメトリック損失でエンドツーエンドに訓練する。

Figure 2 : Overview of MVSplat . Given multiple posed images as input, we first extract multi-view image features with a multi-view Transformer, which contains self- and cross-attention layers to exchange information across views. Next, we construct per-view cost volumes using plane sweeping. The Tr

実験結果

リサーチクエスチョン

RQ1コストボリューム駆動のフィードフォワードモデルは、疎な視点からのシーンスケール再構成において幾何と外観品質を向上させることができるか？
RQ2クロスビューコストボリュームを介して3Dガウス中心とパラメータを共同学習することは、ビュー間の一貫性と一般化を向上させるか？
RQ3RealEstate10K、ACID、DTUにおける精度・速度・パラメータ効率の点でMVSplatは最新手法とどう比較されるか？
RQ4コストボリューム、クロスビューアテンション、refinement U-Net などの構成要素が最終性能に与える影響は何か？

主な発見

手法	時間 (s)	パラメータ (M)	RealEstate10K PSNR	RealEstate10K SSIM	RealEstate10K LPIPS	ACID PSNR	ACID SSIM	ACID LPIPS
pixelNeRF	5.299	28.2	20.43	0.589	0.550	20.97	0.547	0.533
GPNR	13.340	9.6	26.10	0.858	0.143	25.28	0.764	0.332
AttnRend	1.325	125.1	24.78	0.820	0.213	26.88	0.799	0.218
MuRF	0.186	5.3	26.10	0.858	0.143	28.09	0.841	0.155
pixelSplat	0.104	125.4	25.89	0.858	0.142	28.14	0.839	0.150
MVSplat	0.044	12.0	26.39	0.869	0.128	28.25	0.843	0.144

RealEstate10KとACIDのベンチマークでPSNR、SSIM、LPIPSの全指標で最先端のレンダリング品質を達成。
フィードフォワード推論時22fps、パラメータ約1200万個で、speedと効率性の点でpixelSplatを上回る。
pixelSplatより最大10分の1のパラメータで、推論は2倍超の速さ、見た目と幾何学品質は上回る。
コストボリュームベースのエンコーダは致命的に重要。これを除くと性能が大幅に低下（例：PSNRが3dBを超えて低下）。
クロスビューアテンションと2D U-Netの洗練は幾何を大幅に改善し、難所領域にも対応。
ゼロショットのデータセット横断一般化はMVSplatの方がpixelSplatより強く、LPIPSが良く、ドメインシフトに対する頑健性も高い。

Figure 3 : Comparisons with the state of the art . The first three rows are from RealEstate10K (indoor scenes), while the last one is from ACID (outdoor scenes). Models are trained with a collection of training scenes from each indicated dataset, and tested on novel scenes from the same dataset. MVS

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。