Skip to main content
QUICK REVIEW

[论文解读] SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Niccolo Avogaro, Nayanika Debnath|arXiv (Cornell University)|Feb 6, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

SPARC 将感知与推理在视觉-语言模型中分离,通过先对相关图像区域进行局部化再进行推理,从而在不重新训练 backbone 的情况下实现测试时的扩展性、提高效率与准确性。

ABSTRACT

Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$ imes$ lower token budget.

研究动机与目标

  • 通过将感知处理与推理分离来为 VLMs 的测试时扩展提供动力
  • 展示一个两阶段流程,在推理前先定位与问题相关的图像区域
  • 证明模块化感知可以独立训练并优化以提升效率
  • 证明非对称计算分配在不同条件下可提升鲁棒性
  • 提供证据表明基于 IRD 的裁剪可在保持或提升准确度的同时减少标记预算

提出的方法

  • 两阶段提示:第一步输出区域坐标(隐式相关性检测,IRD);第二步提示使用裁剪区域给出最终答案(感知推理)
  • 解耦感知与推理以实现独立优化和上下文高效处理
  • 使用自一致性与裁剪聚合(加权框架融合,Weighted Boxes Fusion)来融合多个 IRD 轮次
  • 在各阶段共享视觉 KV 缓存以减少计算并截断上下文,实现测试时扩展
  • 在 IRD 上训练一个轻量级的感知 LoRA 适配器,结合合成的 IRD 标注数据以提高定位而不损及推理性能
Figure 1 : Overview of the SPARC framework. We decouple the VLM inference process into two distinct functional circuits. Stage 1 (Perception): The What and Where Circuits perform Implicit Relevance Detection (IRD), taking the image and question as input to output relevant crop coordinates (e.g., loc
Figure 1 : Overview of the SPARC framework. We decouple the VLM inference process into two distinct functional circuits. Stage 1 (Perception): The What and Where Circuits perform Implicit Relevance Detection (IRD), taking the image and question as input to output relevant crop coordinates (e.g., loc

实验结果

研究问题

  • RQ1两阶段 SPARC 流水线是否能在比单一提示更少的视觉标记下提升 VLM 性能?
  • RQ2将感知与推理分离是否能实现对感知的非对称计算分配而不降低推理质量?
  • RQ3基于 IRD 的裁剪在内域与外域视觉任务下对准确度有何影响?
  • RQ4通过 LoRA 对轻量级感知进行微调是否能提升 IRD 而不损害推理能力?
  • RQ5裁剪融合(WBF)在稳定并提升下游 VQA 准确度方面的作用是何?

主要发现

  • 与单一提示和“思考-带图像”方法相比,SPARC 在 VQA 风格的准确度方面有显著提升;
  • 通过共享 KV 缓存并利用高分辨率裁剪区域实现测试时高效扩展,降低标记预算
  • 自一致的感知轮次配合 WBF 在下游计算量呈幂次下降的情况下提升了准确度
  • 通过 LoRA 在低分辨率数据上对感知进行微调仍能带来稳定收益,表明正则化效应
  • 在某些 OOD 场景(如 XLRS 遥感)下,SPARC 的标记预算可低至原来的 200× 而仍然获得性能提升
  • 在 V* 和 HRBench 基准测试中,SPARC 在 ID 和 OOD 设置下均优于原生与思考-带图像的基线
Figure 2 : The plot shows downstream reasoning accuracy against the crop overlap ratio. While performance generally degrades as overlap decreases, this effect is most pronounced for lower resolutions. Crucially, at high overlap ratios, the 256px model converges to the performance of the full-resolut
Figure 2 : The plot shows downstream reasoning accuracy against the crop overlap ratio. While performance generally degrades as overlap decreases, this effect is most pronounced for lower resolutions. Crucially, at high overlap ratios, the 256px model converges to the performance of the full-resolut

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。