QUICK REVIEW

[论文解读] Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions

Sjoerd van Steenkiste, Michael Chang|arXiv (Cornell University)|Feb 28, 2018

Multimodal Machine Learning Applications参考文献 37被引用 134

一句话总结

R-NEM 从原始视觉数据中无监督地学习面向对象的表征，并通过关系归纳偏置建模对象之间的相互作用，从而能够外推到对象数量不同和存在遮挡的场景。它在 N-EM 的基础上增加了一个成对交互模块以模拟物理动力学。

ABSTRACT

Common-sense physical reasoning is an essential ingredient for any intelligent agent operating in the real-world. For example, it can be used to simulate the environment, or to infer the state of parts of the world that are currently unobserved. In order to match real-world conditions this causal knowledge must be learned without access to supervised data. To address this problem we present a novel method that learns to discover objects and model their physical interactions from raw visual images in a purely \emph{unsupervised} fashion. It incorporates prior knowledge about the compositional nature of human perception to factor interactions between object-pairs and learn efficiently. On videos of bouncing balls we show the superior modelling capabilities of our method compared to other unsupervised neural approaches that do not incorporate such prior knowledge. We demonstrate its ability to handle occlusion and show that it can extrapolate learned knowledge to scenes with different numbers of objects.

研究动机与目标

使用组成性的对象表示实现对视觉场景中对象的无监督发现。
建模对象之间的相互作用以捕捉物理动力学。
实现对具有不同对象数量和遮挡的场景的鲁棒泛化。
展示在混杂环境中对对象运动的预测精度和短期仿真能力。

提出的方法

用关系交互函数扩展 Neural Expectation Maximization (N-EM)，以形成 R-NEM。
用潜在变量 theta_k 表征每个对象，并通过神经网络 f_phi 建模像素生成。
使用广义 EM 框架，其中 E 步将像素分配给对象分量，M 步更新对象表征。
加入交互函数 Upsilon^R-NEM，通过学习的嵌入和注意力系数计算成对效应。
采用带去噪/下步预测的编码器-解码器架构，以引导对象表征和动力学的学习。
通过反向传播时间误差端到端训练，以优化结合簇内与簇间项的损失（方程(3)）。

实验结果

研究问题

RQ1是否可以从原始视觉输入在无人监督的条件下学习到面向对象的表征？
RQ2关系机制是否能够实现对象间动力学的学习，以预测未来帧？
RQ3对象为中心的表征是否能泛化到训练时看到的对象数量多于或少于的场景？
RQ4模型对遮挡是否鲁棒并且在动态场景中具备对象恒存性？
RQ5对象注意力如何影响对物理交互的学习和外推？

主要发现

R-NEM 在弹跳球序列上实现的预测损失和关系 BCE 损失低于基线（RNN、LSTM、RNN-EM）。
R-NEM 获得大约 0.8 的 ARI 分数，表明在四球场景中大多数球被不同的分量建模。
该模型对 6–8 球场景的外推效果优于竞争对手，显示对未见对象数量的泛化能力提升。
R-NEM 能够准确模拟动力学，在步骤之间保持对象形状和位置，优于基于 RNN 的方法。
遮挡场景（帘幕实验）显示 R-NEM 能保持对象状态并预测重新出现，证明对象恒存性。
注意力机制与碰撞事件对齐，在交互过程中激活上下文对象的影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。