QUICK REVIEW

[论文解读] M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

Peijin Xie, Zhen Xu|arXiv (Cornell University)|Mar 9, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

论文将视觉证据识别视为多模态数学推理的主要瓶颈，并提出 M3-ACE——一个拥有 Summary Tool 与 Refine Tool 的多代理上下文工程框架，在无需额外训练的情况下协同纠正感知，在 MathVision 上达到最先进的结果。

ABSTRACT

Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi-round self-reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3-ACE, a multi-agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically maintaining a shared context centered on visual evidence lists. Multiple agents collaboratively contribute complementary observations, enabling the system to expose inconsistencies and recover missing perceptual information. To support stable multi-turn collaboration, we further introduce two lightweight tools: a Summary Tool that organizes evidence from different agents into consistent, complementary, and conflicting components, and a Refine Tool that filters unreliable samples and guides iterative correction. Extensive experiments demonstrate that M3-ACE substantially improves visual mathematical reasoning performance across multiple benchmarks. Our method establishes new state-of-the-art results 89.1 on the MathVision benchmark and achieves consistent improvements on other related datasets, including MathVista and MathVerse. These results highlight the importance of perception-centric multi-agent collaboration for advancing multimodal reasoning systems.

研究动机与目标

Demonstrate that visual evidence extraction is the primary bottleneck in multimodal visual math reasoning.
Show that single-model self-correction is insufficient for correcting perception errors.
Propose a multi-agent context engineering framework (M3-ACE) to iteratively rectify visual perception.
Introduce lightweight tools (Summary Tool and Refine Tool) to stabilize multi-turn collaboration.
Evaluate the approach on MathVision and related benchmarks to establish state-of-the-art performance.

提出的方法

Decouple visual perception from reasoning by maintaining a shared visual evidence list separate from the final answer.
Use multiple heterogeneous assistant agents to provide diverse visual evidence and expose potential inconsistencies.
Employ a Summary Tool to categorize visual evidence into consistent, complementary, and conflicting groups.
Use a Refine Tool to filter unreliable samples and guide iterative correction until convergence.
Iteratively regenerate and refine the anchor agent’s visual evidence and answer with a multi-round, cross-validated workflow.

实验结果

研究问题

RQ1 Can visual evidence extraction be the main source of errors in multimodal visual math reasoning, and can decoupling perception from reasoning improve results?
RQ2 Is single-model self-correction capable of fixing visual evidence errors, or is external multi-agent collaboration necessary?
RQ3 Does multi-agent context engineering with structured summarization and refinement outperform single-agent prompting and reflection on visual math tasks?
RQ4 How do auxiliary tools (Summary Tool, Refine Tool) influence stability and convergence of iterative perceptual refinement?

主要发现

Visual evidence extraction is identified as the dominant bottleneck in current multimodal visual math reasoning models.
Single-model self-correction with prompting or reflection yields limited improvements and can destabilize correct predictions.
External supervision via multiple agents provides complementary information that improves perceptual accuracy and final answers.
The M3-ACE pipeline with decoupling, complementary information, and filtering significantly boosts performance on MathVision and other benchmarks.
Auxiliary tools enable stable, efficient refinement, focusing effort on hard or disputed samples and reducing computational load.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。