QUICK REVIEW

[论文解读] Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

Haokun Zhao, Wanshi Xu|arXiv (Cornell University)|Mar 19, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

引入 GeoAux-Bench，将文本辅助构建步骤与几何问题的真实视觉更新相连的基准，并提出 A2PO，一种带自适应奖励 shaping 的强化学习框架，用以在推理过程中优化何时以及如何构建视觉辅助材料。

ABSTRACT

Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.

研究动机与目标

将几何推理视为一种可从动态视觉构建中获益的多模态过程，超越静态图解。
创建一个基准 GeoAux-Bench，将文本辅助构建与相应的视觉更新配对。
证明交错的视觉-文本推理优于单模态方法并降低推理不确定性。
提出 A2PO，一种能自适应调度并对视觉构建进行质量控制的强化学习框架，以最大化收益。
显示自适应奖励 shaping 与视觉再提示在 GeoAux-Bench 与外部几何基准上带来最先进的改进。

提出的方法

将 GeoAux-Bench 定义为 4,334 个几何问题和 8,470 幅图解，包括显式的 T_aux <-> I_aux 对齐。
进行初步研究，比较文本 alone、视觉 alone 与交错设置，以量化模态互补性及对困惑度的影响。
引入基于 GRPO 且带三分采样方案（O+、O-、O）的 Action Applicability Policy Optimization（A2PO），以实现反事实推理路径。
使用带时序与质量奖励的自适应奖励 shaping，促进有益、低熵的辅助构建。
在推断阶段，应用视觉再提示，在构建被验证为正确时注入辅助图解。
提出基于检索的可视化集成，以模拟当前模型中的交错推理。

实验结果

研究问题

RQ1仅靠文本辅助指令是否足以捕捉几何推理中相应视觉图解的信息内容？
RQ2交错的视觉-文本构建是否优于单模态方法在几何题解中的表现？
RQ3自适应奖励 shaping 是否能有效控制何时以及如何构建辅助视觉辅助以提升推理性能？
RQ4视觉显著性和高质量构建是否与降低推理困惑度和提升准确性相关？

主要发现

交错的视觉-文本辅助相比单模态具有优势，在初步评估中对单模态提升最高可达 1.97%。
有效的辅助构建可作为熵的降低因素，与更低的推理困惑度和更自信的推断相关。
带三分采样与自适应奖励 shaping 的 A2PO 在 GeoAux-Bench 上相对于强基线取得最高约 3.51% 的增益。
在 GeoAux-Bench 与外部几何数据集中，A2PO 一直优于 GRPO、ToRL 与 GeometryZero 基线，在 7B 模型尺度上表现尤为突出。
辅助图解的视觉显著性提升能降低困惑度、提高准确性，凸显感知清晰度是几何推理的前提。
消融研究显示视觉再提示至关重要，显著提升了文本-only 或静态视觉引导之外的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。