QUICK REVIEW

[论文解读] TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

Daixian Liu, Jiayi Kuang|arXiv (Cornell University)|Jan 23, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

TangramPuzzle 引入一个几何-grounded 的基准，用于评估多模态 LLM 的组合性空间推理，通过 Tangram Construction Expressions (TCE) 在 Outline Prediction 和 End-to-End Tangram Solution Generation 任务上进行可机器验证的严格评估。

ABSTRACT

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual recognition and semantic understanding. Nevertheless, their ability to perform precise compositional spatial reasoning remains largely unexplored. Existing benchmarks often involve relatively simple tasks and rely on semantic approximations or coarse relative positioning, while their evaluation metrics are typically limited and lack rigorous mathematical formulations. To bridge this gap, we introduce TangramPuzzle, a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game. We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications, to mitigate the ambiguity of visual approximation. We design two complementary tasks: Outline Prediction, which demands inferring global shapes from local components, and End-to-End Code Generation, which requires solving inverse geometric assembly problems. We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints, leading to distortions or deformations of the pieces.

研究动机与目标

评估 MLLMs 在超越粗略语义关系的精确空间推理方面的能力。
提供正式的几何表示（TCE），将唐卡配置锚定在严格坐标。
在判别式轮廓推断（Outline Prediction）和建构性逆向组装（End-to-End Tangram Solution Generation）两项任务上评估 MLLMs。
评估模型在严格几何下对刚性、非重叠和拓扑约束的遵循性及其局限性。

提出的方法

引入 Tangram Construction Expression (TCE)，作为一个符号化、基于 LaTeX 的几何模式，编码拼块类型、顶点坐标、边、变换和目标轮廓。
定义两项任务：Outline Prediction（在给定 exact TCE 输入的选项中选择正确轮廓）和 End-to-End Tangram Solution Generation（输出一个完整的 TCE JSON，恰好填充目标轮廓）。
应用基于约束的校验器检查语法、刚性、非重叠、连通性，然后测量轮廓保真度的 IoU 和 Hausdorff 距离。
通过多阶段数据生成流程构建数据（来自 KiloGram 的原始唐卡图案，带标注的捕捉、符号归一化到精确表达、人工验证）。
以标准化的提示和 API 调用评估大量开源与商业 MLLMs；分析与几何约束满足性相关的失败模式 vs. 视觉保真度。

实验结果

研究问题

RQ1MLLMs 在严格几何约束下，能否从局部唐卡组件推断出全局形状？
RQ2MLLMs 是否能够生成符合几何约束的唐卡拼合，恰好填充给定目标轮廓？
RQ3模型是否偏好匹配轮廓以牺牲严格几何约束的遵循？
RQ4上下文示例和对文本几何的依赖如何影响任务目标的表现？
RQ5视觉为主导与文本驱动的几何数据定锚有何差异？

主要发现

在 outlining 轮廓准确性和几何约束遵循性方面，MLLMs 在不同任务上存在较大差异。
高轮廓保真度并不保证约束满足；许多模型为了提升视觉效果而扭曲拼块或产生重叠。
Gemini3-Pro 在几何推理和高约束满足性、轮廓保真度方面表现出色且具鲁棒性。
顶尖模型可能实现高 IoU 或视觉上合理，但不能产生几何上有效的解（在某些情况下成功率为 0%）。
上下文学习可以提升解析答案的形状质量，但会提高句法错误率，表明符号精度与几何理解之间存在权衡。
文本几何有助于大多数模型的定锚；移除文本坐标会降低表现，尽管 Gemini3-Pro 仍然表现强劲。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。