Skip to main content
QUICK REVIEW

[论文解读] Coding the Visual World: From Image to Simulation Using Vision Language Models

Sagi Eppel|arXiv (Cornell University)|Jan 8, 2026
Language and cultural evolution被引用 0
一句话总结

论文研究使用视觉语言模型来描述真实世界图像、生成代码以模拟所描绘的系统,并将合成图像与原图进行比较,揭示较高层次的理解但在细节复制方面有限。

ABSTRACT

The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) have the ability to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.

研究动机与目标

  • 动机:了解VLMs如何建立对图像中描绘的复杂系统的心理模型。
  • 目标:描述图像中的系统并生成可执行的仿真代码以重现所描绘的现象。
  • 目的:评估VLMs在从物理到生态系统等多领域的理解程度。

提出的方法

  • 将 Im2Sim 方法应用于代表系统的真实世界图像(如海浪、云、植被、城市)。
  • 让VLM描述系统并撰写模拟并生成它的代码。
  • 执行生成的代码以产生,用于与原始图像对比的合成图像。
  • 分析输出以评估VLMs的多层抽象能力和领域覆盖范围。
  • 比较高层建模性能与低层、细粒度图像模式的保真度。

实验结果

研究问题

  • RQ1视觉语言模型是否能够准确描述图像中描绘的复杂系统并生成可执行的仿真代码?
  • RQ2生成的仿真在多领域(物理、植被、城市、地质)中在多大程度上能够再现原始图像的广义结构和涌现属性?
  • RQ3VLMs是否在高层理解与细粒度复制之间表现出不对称性?

主要发现

  • 领先的VLMs(如GPT、Gemini)表现出在多层抽象中理解和建模复杂、包含多组件的系统的能力。
  • VLMs在从物理到植被和城市系统等广泛领域具有应用能力。
  • VLMs在复制图像的细节和低层排列方面显示出有限能力。
  • 研究识别出一种不对称性:强烈的高层理解与较弱的低层细节感知。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。