[论文解读] Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language
VRDP 联合学习对象轨迹、语言定位的概念,以及可区分的物理法则,以推理动力学;在 CLEVRER 上达到最先进的结果,并展示数据效率和泛化能力。
In this work, we propose a unified framework, called Visual Reasoning with Differ-entiable Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine. The visual perception module parses each video frame into object-centric trajectories and represents them as latent scene representations. The concept learner grounds visual concepts (e.g., color, shape, and material) from these object-centric representations based on the language, thus providing prior knowledge for the physics engine. The differentiable physics model, implemented as an impulse-based differentiable rigid-body simulator, performs differentiable physical simulation based on the grounded concepts to infer physical properties, such as mass, restitution, and velocity, by fitting the simulated trajectories into the video observations. Consequently, these learned concepts and physical models can explain what we have seen and imagine what is about to happen in future and counterfactual scenarios. Integrating differentiable physics into the dynamic reasoning framework offers several appealing benefits. More accurate dynamics prediction in learned physics models enables state-of-the-art performance on both synthetic and real-world benchmarks while still maintaining high transparency and interpretability; most notably, VRDP improves the accuracy of predictive and counterfactual questions by 4.5% and 11.5% compared to its best counterpart. VRDP is also highly data-efficient: physical parameters can be optimized from very few videos, and even a single video can be sufficient. Finally, with all physical parameters inferred, VRDP can quickly learn new concepts from a few examples.
研究动机与目标
- 学习将视频解析为以对象为中心的轨迹,并从语言中确立视觉概念。
- 整合一个可微分的物理引擎,以从视频数据推断物理属性。
- 使用学习得到的物理知识进行预测性和反事实推理,步骤透明且可解释。
提出的方法
- 视觉感知模块使用 Faster R-CNN 提取对象提案并构建轨迹。
- 概念学习器通过语言驱动的嵌入和最近邻量化来定位对象属性和事件。
- 基于冲击的可微分刚体物理引擎通过将模拟轨迹拟合到观测值来估计质量、恢复系数、速度等参数。
- 启用物理的仿真生成未来轨迹和反事实情景用于推理。
- 符号程序执行器对已落地的概念和仿真数据执行可微分的逐步推理。
- 训练通过适当的损失函数优化程序解析、物理参数和问答目标。
实验结果
研究问题
- RQ1以学习到的概念为基础的显式可微分物理模型,能否提升对来自视频和语言的动态视觉推理?
- RQ2基于物理的表示是否提升了在 CLEVRER 和 Real-Billiard 数据集上的准确性、数据效率和泛化能力?
- RQ3语言中的概念定位如何与感知和物理学交互,以支持预测性和反事实推理?
主要发现
- VRDP 在 CLEVRER 的预测性和反事实问题上达到最先进的性能。
- 模型显示出较强的数据效率,所需数据更少即可达到有竞争力或更高的准确性。
- 落地的物理参数使推理透明、可解释,具有显式的物理含义。
- VRDP 能用少量数据快速泛化到新概念(例如仅用 25 个视频学习“更重”)。
- 消融实验显示课程优化与再优化能提升预测性和反事实问答的准确性。
- 在 Real-Billiard 上,VRDP 展示了对真实场景的有效动态预测。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。