Skip to main content
QUICK REVIEW

[論文レビュー] TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

Yikun Zong, Cheston Tan|arXiv (Cornell University)|Feb 5, 2026
Spatial Cognition and Navigation被引用数 0
ひとこと要約

The paper shows current vision-language models struggle with continuous geometric reasoning in Tangram tasks, and introduces a test-time self-refinement framework using in-context learning and reward-guided feedback to substantially improve IoU without retraining.

ABSTRACT

Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link https://anonymous.4open.science/r/TangramVLM-F582/.

研究の動機と目的

  • Motivate and evaluate VLMs on continuous geometric reasoning using Tangram puzzles to reveal gaps in spatial accuracy.
  • Quantify performance gaps of leading VLMs on single-piece and two-piece Tangram tasks.
  • Propose a training-free test-time refinement framework combining in-context learning and reward-guided feedback to improve geometry.
  • Demonstrate that iterative refinement can substantially improve IoU in continuous spatial domains.
  • Provide a dataset and benchmarks for continuity-space geometric evaluation in multimodal models.

提案手法

  • Annotate Tangram samples with position, angle, and size using canonic templates and compute pixel-level IoU.
  • Evaluate multiple VLMs (Qwen-3B/72B, GPT-4o mini, LLaMA Maverick, Gemini-2.5-pro, Claude) under zero-shot and few-shot ICL.
  • Define and compute geometry-based metrics (L2 position error, angular deviation, size error, IoU) in a 512×512 canvas.
  • Design four tasks (pos-only, angle-only, size-only, two-piece) to progressively test geometric precision and compositional reasoning.
  • Introduce a test-time self-refinement loop (ICL + reward-guided feedback) that optimizes a scalar reward combining IoU and position error without updating model parameters.
  • Use a small local grid search for deterministic refinement when needed.

実験結果

リサーチクエスチョン

  • RQ1How well do current VLMs perform on continuous geometric reasoning tasks like Tangram assembly?
  • RQ2How does performance degrade when moving from single-piece to two-piece Tangram composition?
  • RQ3Can test-time self-refinement with in-context learning and reward-guided feedback close the gap to human-level geometric precision without retraining?
  • RQ4What are the key factors (ICL size, refinement loop iterations, and thresholds) that govern the effectiveness and stability of test-time refinement?
  • RQ5Is the refinement approach generalizable to other continuous spatial reasoning tasks beyond Tangram?

主な発見

  • Across five VLMs, single-piece IoU averages around 0.41 and two-piece IoU around 0.23, far below human performance.
  • IoU on single-piece tasks is highly sensitive to angular accuracy, with angle errors persisting across models.
  • Two-piece arrangement significantly degrades performance due to error accumulation and potential collisions or near-misses.
  • Test-time self-refinement with ICL and reward-guided loops improves medium-triangle IoU from 0.63 to 0.932 without retraining, achieving substantial gains.
  • The refinement loop generally converges within 1–2 iterations, with six iterations often sufficient for near-optimal improvements.
  • Optimal configuration found: ICL with k=15, Loop=6, with threshold tau=0.9 and temperature=0, yielding robust improvements; larger ICL windows or higher temperatures can introduce noise.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。