QUICK REVIEW

[논문 리뷰] Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

Mingyu Ding, Zhenfang Chen|arXiv (Cornell University)|2021. 10. 28.

Multimodal Machine Learning Applications참고 문헌 83인용 수 26

한 줄 요약

VRDP는 객체 궤적, 언어 기반 개념, 및 미분 가능한 물리학을 공동으로 학습하여 다이나믹스를 추론합니다. CLEVRER에서 최첨단 성능을 달성하고 데이터 효율성과 일반화를 입증합니다.

ABSTRACT

In this work, we propose a unified framework, called Visual Reasoning with Differ-entiable Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine. The visual perception module parses each video frame into object-centric trajectories and represents them as latent scene representations. The concept learner grounds visual concepts (e.g., color, shape, and material) from these object-centric representations based on the language, thus providing prior knowledge for the physics engine. The differentiable physics model, implemented as an impulse-based differentiable rigid-body simulator, performs differentiable physical simulation based on the grounded concepts to infer physical properties, such as mass, restitution, and velocity, by fitting the simulated trajectories into the video observations. Consequently, these learned concepts and physical models can explain what we have seen and imagine what is about to happen in future and counterfactual scenarios. Integrating differentiable physics into the dynamic reasoning framework offers several appealing benefits. More accurate dynamics prediction in learned physics models enables state-of-the-art performance on both synthetic and real-world benchmarks while still maintaining high transparency and interpretability; most notably, VRDP improves the accuracy of predictive and counterfactual questions by 4.5% and 11.5% compared to its best counterpart. VRDP is also highly data-efficient: physical parameters can be optimized from very few videos, and even a single video can be sufficient. Finally, with all physical parameters inferred, VRDP can quickly learn new concepts from a few examples.

연구 동기 및 목표

비디오를 객체 중심의 궤적으로 파싱하고 언어로 시각적 개념을 근거화하는 것을 학습한다.
비디오 데이터에서 물리적 특성을 추론하기 위해 미분 가능 물리 엔진을 통합한다.
학습된 물리를 활용하여 투명하고 해석 가능한 단계로 예측 및 반사실적 추론을 수행한다.

제안 방법

Faster R-CNN을 사용한 시각 인지 모듈은 객체 제안을 추출하고 궤적을 구축한다.
개념 학습자는 언어 기반 임베딩과 최근접 이웃 양자화를 통해 객체 속성과 이벤트를 근거화한다.
미분 가능한 충격 기반 강체 물리 엔진은 시뮬레이션된 궤적을 관찰값에 맞춰 질량, 반발계수, 속도 및 기타 매개변수를 추정한다.
물리 기반 시뮬레이션은 미래 궤적과 추론을 위한 반사실적 시나리오를 생성한다.
상징적 프로그램 실행기는 근거화된 개념과 시뮬레이션 데이터를 바탕으로 미분 가능하고 단계별 추론을 수행한다.
학습은 적절한 손실로 프로그램 파싱, 물리 매개변수 및 QA 타깃을 최적화한다.

실험 결과

연구 질문

RQ1학습된 개념에 근거한 명시적 미분 가능 물리 모델이 비디오와 언어에서의 동적 시각 추론을 개선할 수 있는가?
RQ2물리 기반 표현이 CLEVRER 및 Real-Billiard 데이터셋에서 정확도, 데이터 효율성 및 일반화를 향상시키는가?
RQ3언어에서 개념을 근거화하는 것이 지각 및 물리와 어떻게 상호 작용하여 예측 및 반사실적 추론을 지원하는가?

주요 결과

VRDP는 예측적 및 반사실적 질문에 대해 CLEVRER에서 최첨단 성능을 달성한다.
데이타 효율이 강하게 나타나 더 적은 데이터로 경쟁력 있거나 우수한 정확도에 도달한다.
근거화된 물리 매개변수는 명시적 물리적 의미를 가진 투명하고 해석 가능한 추론을 가능하게 한다.
VRDP는 소량 데이터로도 새로운 개념에 일반화한다(예: 25개 비디오에서 ‘더 무겁다’를 학습).
커리큘럼 최적화 및 재최적화가 예측 및 반사실적 QA 정확도를 향상시킴을 확인하는 제거 연구가 존재한다.
Real-Billiard에서 VRDP는 실제 세계 시나리오에서 효과적인 동적 예측을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.