QUICK REVIEW

[論文レビュー] Visual Interaction Networks

Nicholas Watters, Andrea Tacchetti|arXiv (Cornell University)|Jun 5, 2017

Data Visualization and Analytics参考文献 20被引用数 69

ひとこと要約

Visual Interaction Network (VIN) は、CNN ベースの知覚エンコーダと Interaction Network ベースのダイナミクス予測器を組み合わせることで、生データビデオから将来の物体状態を予測することを学習し、不可視物体を含む長期的な物理予測を可能にする。

ABSTRACT

From just a glance, humans can make rich predictions about the future state of a wide range of physical systems. On the other hand, modern approaches from engineering, robotics, and graphics are often restricted to narrow domains and require direct measurements of the underlying states. We introduce the Visual Interaction Network, a general-purpose model for learning the dynamics of a physical system from raw visual observations. Our model consists of a perceptual front-end based on convolutional neural networks and a dynamics predictor based on interaction networks. Through joint training, the perceptual front-end learns to parse a dynamic visual scene into a set of factored latent object representations. The dynamics predictor learns to roll these states forward in time by computing their interactions and dynamics, producing a predicted physical trajectory of arbitrary length. We found that from just six input video frames the Visual Interaction Network can generate accurate future trajectories of hundreds of time steps on a wide range of physical systems. Our model can also be applied to scenes with invisible objects, inferring their future states from their effects on the visible objects, and can implicitly infer the unknown mass of objects. Our results demonstrate that the perceptual module and the object-based dynamics predictor module can induce factored latent representations that support accurate dynamical predictions. This work opens new opportunities for model-based decision-making and planning from raw sensory observations in complex physical environments.

研究の動機と目的

未加工の視覚観測から将来の物理状態を予測するための汎用的なモデルを提供する。
長期的なダイナミクスを正確にサポートする因子化された潜在オブジェクト表現を学習する。
視覚ノイズおよび部分観測性に対するロバスト性を、多様な物理系で実証する。

提案手法

各オブジェクトについて3フレームのトリプレットから状態コードを抽出するためのCNN ベースの視覚エンコーダを使用する。
複数の時間オフセットを用いた Interaction Network ベースのダイナミクス予測器を用いて次ステップの状態コードを予測する。
訓練ターゲットとして状態コードをオブジェクトの位置と速度にデコードする。
将来のステップに対する予測損失と補助エンコーダ損失を組み合わせた損失でエンドツーエンドに訓練する。
長期的なロールアウトを評価し、状態間モデルや映像のみモデルを含むベースラインと比較する。

実験結果

リサーチクエスチョン

RQ1 perceptual front-end（知覚フロントエンド）とオブジェクト中心のダイナミクス予測器が、ビデオから状態を推定し将来の軌道を予測することを共同で学習できるか？
RQ2VIN はオブジェクト数の増加や部分的に可視でないオブジェクトに対してどの程度スケールするか？
RQ3時間的オフセットの集約と関係推論は、ベースラインと比較して長期の物理予測を改善するか？
RQ4視覚エンコーダのノイズに対してモデルは堅牢で、見えない質量のような隠れた量を推定できるか？

主な発見

VIN は 3 オブジェクトおよび 6 オブジェクトのシーンで、逆正規化化損失の全データセットにおいてベースラインを上回る。
VIN は長期的なロールアウトを正確に達成し、データセット全体で 50 ステップにわたりユークリッド予測誤差が低い状態を維持する。
VIN は可視オブジェクトへの影響から不可視オブジェクトの位置を推定でき（例：隠れたばね）、初期ロールアウトのステップでフレーム幅の約 4%程度の精度で推定可能。
ドリフト（相互作用なし）シナリオでは、関係ネットワークを欠くアブレーション版と性能が一致しており、相互作用が存在する場合の関係推論の役割を浮き彫りにする。
訓練時の知覚/ノイズ入力は、純粋な状態間モデルと比較して長期ロールアウトのロバスト性を向上させるように見える。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。