QUICK REVIEW

[論文レビュー] Deep 3D Pan via Local adaptive "t-shaped" convolutions with global and local adaptive dilations

Juan Luis Gonzalez Bello, Munchurl Kim|arXiv (Cornell University)|Apr 30, 2020

Advanced Vision and Imaging参考文献 39被引用数 3

ひとこと要約

本稿では、1枚の2次元入力画像から高品質なシングルイメージ3Dパン（Deep 3D Pan）合成を実現するため、グローバルおよびローカルのアダプティブ拡張を備えたT字型のアダプティブ畳み込みを用いた深層学習アーキテクチャ、Monster-Netを提案する。この手法は、1枚の2D入力画像からグローバルなカメラのシフトとローカルな3D幾何構造を効果的にモデル化することで、視覚合成および教師なし単眼深度推定の分野で最先端の性能を達成する。

ABSTRACT

Recent advances in deep learning have shown promising results in many low-level vision tasks. However, solving the single-image-based view synthesis is still an open problem. In particular, the generation of new images at parallel camera views given a single input image is of great interest, as it enables 3D visualization of the 2D input scenery. We propose a novel network architecture to perform stereoscopic view synthesis at arbitrary camera positions along the X-axis, or Deep 3D Pan, with adaptive kernels equipped with globally and locally adaptive dilations. Our proposed network architecture, the monster-net, is devised with a novel t-shaped adaptive kernel with globally and locally adaptive dilation, which can efficiently incorporate global camera shift into and handle local 3D geometries of the target image's pixels for the synthesis of naturally looking 3D panned views when a 2-D input image is given. Extensive experiments were performed on the KITTI, CityScapes and our VXXLXX_STEREO indoors dataset to prove the efficacy of our method. Our monster-net significantly outperforms the state-of-the-art method, SOTA, by a large margin in all metrics of RMSE, PSNR, and SSIM. Our proposed monster-net is capable of reconstructing more reliable image structures in synthesized images with coherent geometry. Moreover, the disparity information that can be extracted from the kernel is much more reliable than that of the SOTA for the unsupervised monocular depth estimation task, confirming the effectiveness of our method.

研究の動機と目的

1枚の2D入力画像から現実的で高品質な3Dパンビューを合成する課題に対処すること。
シングルイメージ視覚合成におけるグローバルなカメラシフトとローカルな3D幾何構造の両方のモデリングを向上させること。
教師なし単眼深度予測のための視差推定の信頼性を高めること。
視覚合成の質および幾何的整合性において、既存の最先端手法を上回ること。

提案手法

本手法は、グローバルおよびローカルのアダプティブ拡張率を用いて受容 field を動的に調整する、新規なT字型アダプティブカーネルを導入する。
グローバルなアダプティブ拡張は、X軸方向の全体的なカメラシフトをネットワークの特徴学習プロセスに組み込む。
ローカルなアダプティブ拡張は、ターゲットビューにおける各ピクセル周辺の局所的3D幾何構造を細分化してモデリング可能にする。
このアダプティブ畳み込みを統合したネットワークアーキテクチャであるMonster-Netは、任意のカメラ位置での高精細なステレオビューを合成する。
アダプティブ拡張メカニズムはトレーニング中にエンドツーエンドで学習され、ネットワークが入力コンテンツに応じてカーネル拡張を自動でキャリブレーション可能になる。
同じ特徴マップを用いて合成ビューと視差マップの両方を生成することで、一貫性と信頼性が向上する。

実験結果

リサーチクエスチョン

RQ1アダプティブ拡張は、1枚の2D画像から合成される3Dパンビューの品質および幾何的整合性を向上させることができるか？
RQ2グローバルなカメラシフトとローカルな3D幾何構造の統合は、視覚合成性能にどのように影響を与えるか？
RQ3提案されたネットワークは、最先端の手法と比較して、優れた性能を達成できるか？
RQ4T字型アダプティブカーネルは、視覚合成タスクにおける特徴表現をどの程度向上させるか？

主な発見

Monster-Netは、KITTI、CityScapes、VXXLXX_STEREOデータセットにおけるRMSE、PSNR、SSIMのすべての指標で、最先端の手法を顕著に上回る。
ベースライン手法と比較して、合成画像はより信頼性の高い画像構造と整合性のある3D幾何構造を示す。
Monster-Netが予測する視差マップは、SOTA手法のものよりも正確で一貫性があることが確認され、深度推定能力の向上が裏付けられた。
複数のベンチマークデータセットを用いた検証により、屋内および屋外環境を含む多様なシーンにおいても優れた汎化性能を示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。