QUICK REVIEW

[论文解读] SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

Zhixuan Lin, Yifu Wu|arXiv (Cornell University)|Jan 8, 2020

Advanced Image and Video Retrieval Techniques参考文献 25被引用 44

一句话总结

SPACE 将空间注意力和场景混合在一个概率模型中，联合分解前景对象和复杂背景，采用并行前景处理以实现可扩展的无监督对象中心场景表示，在 Atari 和 3D-Rooms 上对比 SPAIR、IODINE、GENESIS。

ABSTRACT

The ability to decompose complex multi-object scenes into meaningful abstractions like objects is fundamental to achieve higher-level cognition. Previous approaches for unsupervised object-oriented scene representation learning are either based on spatial-attention or scene-mixture approaches and limited in scalability which is a main obstacle towards modeling real-world scenes. In this paper, we propose a generative latent variable model, called SPACE, that provides a unified probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches. SPACE can explicitly provide factorized object representations for foreground objects while also decomposing background segments of complex morphology. Previous models are good at either of these, but not both. SPACE also resolves the scalability problems of previous methods by incorporating parallel spatial-attention and thus is applicable to scenes with a large number of objects without performance degradations. We show through experiments on Atari and 3D-Rooms that SPACE achieves the above properties consistently in comparison to SPAIR, IODINE, and GENESIS. Results of our experiments can be found on our project website: https://sites.google.com/view/space-project-page

研究动机与目标

为具有遮挡和复杂背景的多对象场景的结构化场景表示的无监督学习提供动力。
提出 SPACE 将空间注意力和场景混合方法在一个概率潜变量框架内统一。
通过并行处理前景对象来解决可扩展性问题，同时保持解耦的对象表示。

提出的方法

引入一个具有并行空间注意力的前景模块，以在每个网格单元生成 z_where、z_0pt、z_pres 和 z_what。
使用 Spatial Transformer 将每个前景对象并行渲染到画布上。
用一个 K 组件的逐像素混合来建模背景，每个混合分量具有潜在的 z^m（混合）和 z^c（颜色），由 VAE 解码。
用变分目标（ELBO）训练，联合考虑前景和背景，对单元潜变量采用平均场近似。
通过一个辅助边界损失防止框分割，即阻止对象掩模触及 glimpse 边界。
通过并行前景处理展示可扩展性，与 SPAIR、IODINE、GENESIS 的顺序推理形成对比。

实验结果

研究问题

RQ1SPACE 能否在分解复杂背景组件的同时提供明确的对象中心前景表示？
RQ2并行前景处理是否在不牺牲前景检测质量的情况下提升可扩展性和速度？
RQ3在 Atari 和 3D-Room 数据集上，SPACE 在收敛速度、速度和边界框质量方面与 SPAIR、IODINE、GENESIS 的对比如何？

主要发现

Model	Dataset	Avg. Precision (IoU=0.5)	Avg. Precision (IoU 0.5:0.95)	Object Count Error Rate
SPACE (16×16)	3D-Room Large	0.8927 ± 0.0027	0.4445 ± 0.0075	0.0446 ± 0.0026
SPAIR (16×16)	3D-Room Large	0.9072 ± 0.0003	0.4364 ± 0.0179	0.0360 ± 0.0072
SPACE (8×8)	3D-Room Small	0.9027 ± 0.0009	0.5069 ± 0.0030	0.0397 ± 0.0026
SPAIR (8×8)	3D-Room Small	0.9081 ± 0.0004	0.5068 ± 0.0081	0.0209 ± 0.0039

SPACE 在边界框质量方面与 SPAIR 相当，同时在梯度步进延迟和训练收敛速度上实现数量级的提升。
SPACE 可以扩展到大量前景对象而不会因并行前景处理而显著降低性能。
SPACE 提供显式、解耦的前景对象，具备每个对象的属性（位置、尺度）并分解背景组件，在对 3D-Room 和 Atari 的定性分析中优于基线。
定量结果表明 SPACE 在 3D-Room Large 设置下在平均精度方面与 SPAIR 相当且对象计数误差率更低，收敛更快且渲染并行化。
背景：SPACE 的背景被分解为多个组件，相较将背景视为单一 blob 的模型，能更好地建模复杂形态。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。