QUICK REVIEW

[论文解读] TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

Yikun Zong, Cheston Tan|arXiv (Cornell University)|Feb 5, 2026

Spatial Cognition and Navigation被引用 0

一句话总结

这篇论文显示当前的视觉-语言模型在 Tangram 任务中的连续几何推理存在困难，并引入一个在测试时自我改进框架，利用上下文学习与奖励引导反馈，在不重新训练的情况下显著提升 IoU。

ABSTRACT

Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link https://anonymous.4open.science/r/TangramVLM-F582/.

研究动机与目标

以 Tangram 拼图评估并激发 VLM 对连续几何推理的能力，以揭示空间精度的不足。
量化领先 VLM 在单块与两块 Tangram 任务上的性能差距。
提出一个训练无关的测试时细化框架，结合上下文学习与奖励引导反馈以改善几何表现。
证明迭代细化在连续空间域中能显著提升 IoU。
提供用于多模态模型连续空间几何评估的数据集与基准。

提出的方法

使用规范模板对 Tangram 样本标注位置、角度和尺寸，并计算像素级 IoU。
在零-shot 与少量-shot ICL 情况下评估多种 VLM（Qwen-3B/72B、GPT-4o mini、LLaMA Maverick、Gemini-2.5-pro、Claude）。
在 512×512 画布上定义并计算基于几何的指标（位置的 L2 误差、角度偏差、尺寸误差、IoU）。
设计四个任务（仅位置、仅角度、仅尺寸、两块拼图）以逐步测试几何精度与组合推理。
引入测试时自我改进循环（ICL + 奖励引导反馈），在不更新模型参数的前提下优化综合 IoU 与位置误差的标量奖励。
在需要时使用小规模局部网格搜索进行确定性细化。

实验结果

研究问题

RQ1当前的 VLM 在像 Tangram 拼接这样的连续几何推理任务上的表现如何？
RQ2从单块到两块拼图的组合中，性能会如何下降？
RQ3在不重新训练的情况下，是否可以通过带有上下文学习和奖励引导反馈的测试时自我改进来缩小与人工级几何精度的差距？
RQ4影响测试时细化有效性与稳定性的关键因素有哪些（ICL 的大小、细化循环迭代次数、阈值）？
RQ5该细化方法是否可推广到 Tangram 之外的其他连续空间推理任务？

主要发现

在五种 VLM 中，单块 IoU 平均约为 0.41，两块 IoU 约为 0.23，远低于人类水平。
单块任务的 IoU 对角度精度高度敏感，角度误差在各模型中普遍存在。
两块拼图的排列由于误差积累以及潜在的碰撞或错过，显著降低了性能。
在不重新训练的前提下，使用 ICL 与奖励引导循环的测试时自我改进将中等三角形 IoU 从 0.63 提升至 0.932，取得显著增益。
细化循环通常在 1–2 次迭代内收敛，六次迭代通常足以实现近似最优的改进。
发现的最优配置为：ICL 取 k=15，Loop=6，阈值 tau=0.9，温度为 0，能够实现稳健的改进；更大的 ICL 窗口或更高的温度可能引入噪声。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。