QUICK REVIEW

[论文解读] TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

Aljaž Božič, Pablo Palafox|arXiv (Cornell University)|Jul 5, 2021

Advanced Vision and Imaging参考文献 45被引用 31

一句话总结

TransformerFusion 使用 transformer 基于多视图特征融合，以单目 RGB 视频在线、从粗到细的方式重建 3D 场景，达到最先进的结果。

ABSTRACT

We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation. Key to our approach is the transformer architecture that enables the network to learn to attend to the most relevant image frames for each 3D location in the scene, supervised only by the scene reconstruction task. Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed, requiring lower memory storage and enabling fusion at interactive rates. The feature grid is then decoded to a higher-resolution scene reconstruction, using an MLP-based surface occupancy prediction from interpolated coarse-to-fine 3D features. Our approach results in an accurate surface reconstruction, outperforming state-of-the-art multi-view stereo depth estimation methods, fully-convolutional 3D reconstruction approaches, and approaches using LSTM- or GRU-based recurrent networks for video sequence fusion.

研究动机与目标

从 RGB 视频为交互式应用动机的单目 3D 场景重建。
提出基于 transformer 的融合机制，在每个 3D 位置关注最有信息量的帧。
通过粗到细的特征融合与选择性视图维护，实现在线、交互式重建。

提出的方法

用一个 2D CNN 对每个输入帧进行编码，产生粗略和精细的图像特征。
在世界坐标系中将 2D 特征以粗略和精细分辨率投影到 3D 网格中。
使用两个 transformer 网络在时间上对粗略和精细网格的特征进行融合，生成 psi^c 与 psi^f。
在粗略和精细网格上应用 3D CNN 的精细化，并预测近表面占据掩码（粗略和精细）以实现高效过滤。
对粗略和精细特征进行插值并用一个 MLP 解码为占据 o 以进行表面重建；使用 Marching Cubes 提取网格。
端到端训练，针对近表面掩码和表面占据使用 BCE 损失；从 ScanNet 进行遮挡感知的真实数据采样。

实验结果

研究问题

RQ1 transformer 基于多视图特征融合是否能在单目 3D 重建质量上超过先前的多视角深度估计或 3D 表面预测方法？
RQ2粗到细的融合与在线视图选择是否在保持准确性的同时实现交互速率的重建？
RQ3学习得到的视图注意力在为每个 3D 位置选择信息量大的帧方面有多有效？
RQ4空间细化和近表面掩码对重建质量与运行时的影响？

主要发现

方法	Acc ↓	Compl ↓	Chamfer ↓	Prec ↑	Recall ↑	F-score ↑
RevisitingSI	14.29	16.19	15.24	0.346	0.293	0.314
MVDepthNet	12.94	8.34	10.64	0.443	0.487	0.460
GPMVS	12.90	8.02	10.46	0.453	0.510	0.477
ESTDepth	12.71	7.54	10.12	0.456	0.542	0.491
DPSNet	11.94	7.58	9.77	0.474	0.519	0.492
DELTAS	11.95	7.46	9.71	0.478	0.533	0.501
DeepVideoMVS	10.68	6.90	8.79	0.541	0.592	0.563
COLMAP	10.22	11.88	11.05	0.509	0.474	0.489
NeuralRecon	5.09	9.13	7.11	0.630	0.612	0.619
Atlas	7.16	7.61	7.38	0.675	0.605	0.636
Ours w/o TRSF avg	7.23	9.74	8.48	0.635	0.501	0.557
Ours w/o TRSF pred	6.11	11.12	8.61	0.686	0.512	0.583
Ours w/o spatial ref.	10.46	16.91	13.68	0.479	0.295	0.361
Ours 4 images, RND	8.01	10.28	9.15	0.587	0.445	0.502
Ours 4 images	6.80	8.40	7.60	0.661	0.524	0.581
Ours 8 images, RND	6.74	8.55	7.64	0.665	0.544	0.596
Ours 8 images	6.17	7.69	6.93	0.704	0.584	0.636
Ours 16 images, RND	5.80	8.56	7.18	0.711	0.584	0.638
Ours w/o C2F filter	6.57	7.69	7.13	0.678	0.592	0.631
Ours	5.52	8.27	6.89	0.728	0.600	0.655

在 ScanNet 上在 Chamfer 距离和 F-score 方面超过最先进的方法。
基于 transformer 的视图融合显著优于 MLP 求平均基线。
粗到细的细化和近表面掩码提升质量并实现约 7 FPS 的在线性能。
基于视图注意力的帧选择在不牺牲精度的前提下降低了每个位置所需的视图数量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。