QUICK REVIEW

[논문 리뷰] M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

Peijin Xie, Zhen Xu|arXiv (Cornell University)|2026. 03. 09.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

이 논문은 시각적 증거를 다중 모달 수학 추론의 주요 병목으로 식별하고, Summary Tool과 Refine Tool이 협력하여 추가 훈련 없이 지각을 수정하는 M3-ACE라는 다에이전트 컨텍스트 엔지니어링 프레임워크를 소개하며 MathVision에서 최첨단 성과를 달성한다.

ABSTRACT

Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi-round self-reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3-ACE, a multi-agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically maintaining a shared context centered on visual evidence lists. Multiple agents collaboratively contribute complementary observations, enabling the system to expose inconsistencies and recover missing perceptual information. To support stable multi-turn collaboration, we further introduce two lightweight tools: a Summary Tool that organizes evidence from different agents into consistent, complementary, and conflicting components, and a Refine Tool that filters unreliable samples and guides iterative correction. Extensive experiments demonstrate that M3-ACE substantially improves visual mathematical reasoning performance across multiple benchmarks. Our method establishes new state-of-the-art results 89.1 on the MathVision benchmark and achieves consistent improvements on other related datasets, including MathVista and MathVerse. These results highlight the importance of perception-centric multi-agent collaboration for advancing multimodal reasoning systems.

연구 동기 및 목표

다중 모달 시각 수학 추론에서 시각적 증거 추출이 주요 병목임을 입증한다.
단일 모델의 자기 수정이 지각 오류를 수정하기에 불충분하다는 것을 보인다.
시각 지각을 반복적으로 수정하기 위한 다에이전트 컨텍스트 엔지니어링 프레임워크(M3-ACE)를 제안한다.
다중 턴 협업의 안정화를 위해 경량 도구(Summary Tool 및 Refine Tool)를 도입한다.
MathVision 및 관련 벤치마크에서 접근법을 평가하여 최첨단 성능을 확립한다

제안 방법

최종 답변과 분리된 공유 시각 증거 목록을 유지해 시각 지각을 추론과 분리한다.
여러 가지 이질적인 어시스턴트 에이전트를 사용해 다양한 시각 증거를 제공하고 잠재적 불일치를 드러낸다.
Summary Tool을 사용해 시각 증거를 일관된, 보완적인, 상충하는 그룹으로 분류한다.
Refine Tool을 사용해 신뢰할 수 없는 샘플을 걸러내고 수렴까지 반복 보정하도록 안내한다.
다중 라운드의 교차 검증된 워크플로우로 앵커 에이전트의 시각 증거와 답변을 반복적으로 재생성하고 다듬는다

실험 결과

연구 질문

RQ1다중 모달 시각 수학 추론에서 시각 증거 추출이 주된 오류 원인일 수 있으며 지각과 추론의 분리가 결과를 개선할 수 있는가?
RQ2단일 모델의 자기 수정이 시각 증거 오류를 수정할 수 있는가, 아니면 외부 다에이전트 협력이 필요한가?
RQ3구조화된 요약 및 정제를 통한 다에이전트 컨텍스트 엔지니어링이 시각 수학 과제에서 단일 에이전트 프롬프트 및 반영보다 우월한가?
RQ4보조 도구(Summary Tool, Refine Tool)가 반복적 지각 수정의 안정성과 수렴에 어떤 영향을 미치는가?

주요 결과

시각 증거 추출이 현재 다모달 시각 수학 추론 모델의 지배적 병목으로 확인되었다.
프롬프트나 반영을 통한 단일 모델의 자기 수정은 제한된 개선을 가져오며 올바른 예측을 불안정하게 만들 수 있다.
다수의 에이전트를 통한 외부 감독은 보완 정보를 제공하여 지각 정확도와 최종 답변을 개선한다.
분리, 보완적 정보, 필터링을 갖춘 M3-ACE 파이프라인은 MathVision 및 다른 벤치마크에서 성능을 크게 높인다.
보조 도구는 안정적이고 효율적인 정제를 가능하게 하여 어려운 샘플이나 이의 제기에 집중하고 계산 부하를 줄인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.