QUICK REVIEW

[论文解读] Visual Interaction Networks

Nicholas Watters, Andrea Tacchetti|arXiv (Cornell University)|Jun 5, 2017

Data Visualization and Analytics参考文献 20被引用 69

一句话总结

视觉交互网络（VIN）通过将基于CNN的感知编码器与基于交互网络的动力学预测器相结合，学习从原始视频预测未来对象状态，实现包括不可见对象在内的长时域物理预测。

ABSTRACT

From just a glance, humans can make rich predictions about the future state of a wide range of physical systems. On the other hand, modern approaches from engineering, robotics, and graphics are often restricted to narrow domains and require direct measurements of the underlying states. We introduce the Visual Interaction Network, a general-purpose model for learning the dynamics of a physical system from raw visual observations. Our model consists of a perceptual front-end based on convolutional neural networks and a dynamics predictor based on interaction networks. Through joint training, the perceptual front-end learns to parse a dynamic visual scene into a set of factored latent object representations. The dynamics predictor learns to roll these states forward in time by computing their interactions and dynamics, producing a predicted physical trajectory of arbitrary length. We found that from just six input video frames the Visual Interaction Network can generate accurate future trajectories of hundreds of time steps on a wide range of physical systems. Our model can also be applied to scenes with invisible objects, inferring their future states from their effects on the visible objects, and can implicitly infer the unknown mass of objects. Our results demonstrate that the perceptual module and the object-based dynamics predictor module can induce factored latent representations that support accurate dynamical predictions. This work opens new opportunities for model-based decision-making and planning from raw sensory observations in complex physical environments.

研究动机与目标

提供一个通用模型，以从原始视觉观察中预测未来物理状态。
学习可分解的潜在对象表示，以支持准确的长时间尺度动力学。
展示对视觉噪声和部分观测性在多种物理系统中的鲁棒性。

提出的方法

使用基于CNN的可视编码器从每个对象的三帧序列中提取状态编码。
使用基于交互网络的动力学预测器，具有多个时间偏移，以预测下一步的状态编码。
将状态编码解码为对象的位置和速度，作为训练目标。
端到端训练，损失函数为未来步的预测损失加上辅助编码器损失。
在长时域上评估滚动预测，并与基线进行比较，包括状态到状态模型和仅视觉模型。

实验结果

研究问题

RQ1感知前端与面向对象的动力学预测器能够共同学习从视频推断状态并预测未来轨迹吗？
RQ2VIN 在对象数量增加以及对部分可观测（不可见）对象的情况下的扩展性如何？
RQ3相比基线，时序偏移聚合和关系推理是否能提升长时域物理预测？
RQ4模型对视觉编码器噪声是否鲁棒，且能推断如看不见的质量等隐藏量吗？

主要发现

在所有数据集上，VIN在逆归一化损失方面均优于基线，覆盖3对象和6对象场景。
VIN实现了准确的长时程滚动预测，在所有数据集上50步内欧氏预测误差保持较低。
VIN能够从可见对象的影响推断不可见对象的位置（如隐藏的弹簧），初始滚动步长约在帧宽度的4%之内。
在漂移（无交互）场景中，VIN 的表现与缺少关系网络的消融版本一致，突出在存在交互时关系推理的作用。
训练期间的感知/有噪声输入似乎提升了长时滚出鲁棒性，相较于纯状态到状态模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。