QUICK REVIEW

[論文レビュー] Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

Mingyu Ding, Zhenfang Chen|arXiv (Cornell University)|Oct 28, 2021

Multimodal Machine Learning Applications参考文献 83被引用数 26

ひとこと要約

VRDP は物体の軌跡、言語-grounded 概念、そして微分可能な物理を共同学習してダイナミクスを推論します。CLEVRER で最先端の結果を達成し、データ効率性と一般化を示します。

ABSTRACT

In this work, we propose a unified framework, called Visual Reasoning with Differ-entiable Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine. The visual perception module parses each video frame into object-centric trajectories and represents them as latent scene representations. The concept learner grounds visual concepts (e.g., color, shape, and material) from these object-centric representations based on the language, thus providing prior knowledge for the physics engine. The differentiable physics model, implemented as an impulse-based differentiable rigid-body simulator, performs differentiable physical simulation based on the grounded concepts to infer physical properties, such as mass, restitution, and velocity, by fitting the simulated trajectories into the video observations. Consequently, these learned concepts and physical models can explain what we have seen and imagine what is about to happen in future and counterfactual scenarios. Integrating differentiable physics into the dynamic reasoning framework offers several appealing benefits. More accurate dynamics prediction in learned physics models enables state-of-the-art performance on both synthetic and real-world benchmarks while still maintaining high transparency and interpretability; most notably, VRDP improves the accuracy of predictive and counterfactual questions by 4.5% and 11.5% compared to its best counterpart. VRDP is also highly data-efficient: physical parameters can be optimized from very few videos, and even a single video can be sufficient. Finally, with all physical parameters inferred, VRDP can quickly learn new concepts from a few examples.

研究の動機と目的

動画を物体中心の軌跡へ解析し、言語から視覚概念を grounding する。
微分可能な物理エンジンを統合して、動画データから物理的性質を推測する。
学習した物理を用いて、透明で解釈可能なステップで予測的および反実仮想的推論を行う。

提案手法

視覚認識モジュールは Faster R-CNN を用いて物体提案を抽出し、軌跡を構築する。
概念学習器は言語駆動の埋め込みと最近傍量化を通じて、物体属性とイベントを地づける。
微分可能な衝撃ベースの剛体物理エンジンは、観測に対してシミュレーション軌跡を適合させることで質量、反発、速度などのパラメータを推定する。
物理を用いたシミュレーションは将来の軌跡と反実仮想シナリオを生成して推論を行う。
記号的プログラム実行子は、地づけられた概念とシミュレーションデータ上で微分可能な、段階的な推論を実行する。
訓練は適切な損失でプログラム解析、物理パラメータ、QA目標を最適化する。）

実験結果

リサーチクエスチョン

RQ1学習された概念に基づく明示的な微分可能な物理モデルは、ビデオと言語からの動的視覚推論を改善できるか？
RQ2物理ベースの表現は CLEVRER および Real-Billiard データセットで精度、データ効率、一般化を高めるか？
RQ3言語からの概念 grounding は、知覚と物理とどのように相互作用して予測および反実仮想推論を支えるか？

主な発見

Methods	Overall	Predictive	Counterfactual	Descriptive	Explanatory	per task	per ques.
VRDP (ours)	82.9	86.9	91.7	83.8	89.9	75.7	89.8
VRDP (ours) †	86.6	89.4	94.5	89.2	92.5	80.7	91.5
VRDP (ours) †‡	90.3	92.0	95.7	91.4	94.8	84.3	93.4

VRDP は CLEVRER の予測的・反実仮想的質問で最先端の性能を達成。
モデルはデータ効率が高く、競合するまたは優れた精度に到達するのに少ないデータを要する。
grounding された物理パラメータにより、明確な物理的意味を持つ透明で解釈可能な推論を実現。
VRDP は新しい概念へ few-shot データで一般化する（例: 25 本の動画から『重い』を学習）。
アブレーションはカリキュラム最適化と再最適化が予測および反実QAの精度を改善する。
Real-Billiard では VRDP が実世界シナリオで効果的な動的予測を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。