QUICK REVIEW

[論文レビュー] M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

Peijin Xie, Zhen Xu|arXiv (Cornell University)|Mar 9, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

論文は視覚的証拠をマルチモーダル数理推論の主なボトルネックと特定し、M3-ACEというマルチエージェント文脈設計フレームワークを導入します。Summary ToolとRefine Toolを用いて追加の訓練なしに知覚を協働で是正し、MathVisionで最新の結果を達成します。

ABSTRACT

Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi-round self-reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3-ACE, a multi-agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically maintaining a shared context centered on visual evidence lists. Multiple agents collaboratively contribute complementary observations, enabling the system to expose inconsistencies and recover missing perceptual information. To support stable multi-turn collaboration, we further introduce two lightweight tools: a Summary Tool that organizes evidence from different agents into consistent, complementary, and conflicting components, and a Refine Tool that filters unreliable samples and guides iterative correction. Extensive experiments demonstrate that M3-ACE substantially improves visual mathematical reasoning performance across multiple benchmarks. Our method establishes new state-of-the-art results 89.1 on the MathVision benchmark and achieves consistent improvements on other related datasets, including MathVista and MathVerse. These results highlight the importance of perception-centric multi-agent collaboration for advancing multimodal reasoning systems.

研究の動機と目的

Demonstrate that visual evidence extraction is the primary bottleneck in multimodal visual math reasoning.
Show that single-model self-correction is insufficient for correcting perception errors.
Propose a multi-agent context engineering framework (M3-ACE) to iteratively rectify visual perception.
Introduce lightweight tools (Summary Tool and Refine Tool) to stabilize multi-turn collaboration.
Evaluate the approach on MathVision and related benchmarks to establish state-of-the-art performance.

提案手法

Decouple visual perception from reasoning by maintaining a shared visual evidence list separate from the final answer.
Use multiple heterogeneous assistant agents to provide diverse visual evidence and expose potential inconsistencies.
Employ a Summary Tool to categorize visual evidence into consistent, complementary, and conflicting groups.
Use a Refine Tool to filter unreliable samples and guide iterative correction until convergence.
Iteratively regenerate and refine the anchor agent’s visual evidence and answer with a multi-round, cross-validated workflow.

実験結果

リサーチクエスチョン

RQ1Can visual evidence extraction be the main source of errors in multimodal visual math reasoning, and can decoupling perception from reasoning improve results?
RQ2Is single-model self-correction capable of fixing visual evidence errors, or is external multi-agent collaboration necessary?
RQ3Does multi-agent context engineering with structured summarization and refinement outperform single-agent prompting and reflection on visual math tasks?
RQ4How do auxiliary tools (Summary Tool, Refine Tool) influence stability and convergence of iterative perceptual refinement?

主な発見

Visual evidence extraction is identified as the dominant bottleneck in current multimodal visual math reasoning models.
Single-model self-correction with prompting or reflection yields limited improvements and can destabilize correct predictions.
External supervision via multiple agents provides complementary information that improves perceptual accuracy and final answers.
The M3-ACE pipeline with decoupling, complementary information, and filtering significantly boosts performance on MathVision and other benchmarks.
Auxiliary tools enable stable, efficient refinement, focusing effort on hard or disputed samples and reducing computational load.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。