QUICK REVIEW

[논문 리뷰] CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Tongkun Guan, Zhibo Yang|arXiv (Cornell University)|2026. 03. 11.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

이 논문은 STEM 시각적 추론에서 인식(perception)을 주요 병목 현상으로 제시하고 ICC-1M 및 STEM2Code-Eval 벤치마크를 갖춘 CodePercept를 통해 실행 가능한 코드로 인식(perception)을 강화하는 코드 기반 프레임워크를 제시합니다.

ABSTRACT

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

연구 동기 및 목표

스케일링 분석을 통해 인식이 STEM 시각적 추론에서 한계를 보이는지 식별합니다.
STEM 도메인에서 시각적 이해를 향상시키기 위한 코드 기반 인식 패러다움을 제안합니다.
코드 기반의 인식을 학습하고 평가하기 위한 대규모 데이터 및 벤치마크를 구축합니다.
코드가 인식 작업에서 캡션 기반 접근법과 대등하거나 더 우수한 성능을 보일 수 있음을 보여줍니다.

제안 방법

STEM 시각적 추론을 인식(이미지-캡션)과 추론(캡션-답) 단계로 분리하고 각 구성 요소를 독립적으로 확장합니다.
실행 가능한 Python 코드로 인식을 결합하기 위해 이미지-코드-캡션 3중 항 triplets를 생성합니다. 코드은 정확한 의미 매개로 사용됩니다.
ICC-1M, 이미지 재생성, 이미지 다양성, 템플릿으로 구성된 견고 기하학 합성의 3개 파이프라인을 통해 1M+ 이미지-캡션-코드 데이터셋을 만듭니다.
STEM2Code-Eval, 모델이 이미지를 재현하는 실행 가능한 코드를 생성하도록 요구하는 1,000-이미지 벤치마크를 제시하여 결정론적 인식 평가를 가능하게 합니다.
감독된 미세조정 및 명시적 보상과 함께 GRPO를 사용하는 두 가지 Code-Grounded 작업인 Code-Grounded Caption Generation 및 STEM Image-to-Code Translation으로 모델을 학습합니다.
코드 기반의 감독 신호를 가진 두 단계의 학습 체계(CodePercept-S1 및 CodePercept-R1)로 캡션과 코드 생성을 공동으로 학습합니다.

실험 결과

연구 질문

RQ1MLLMs에서 STEM 시각적 추론의 병목이 인식인지 식별할 수 있는가?
RQ2실행 가능한 코드로 인식 기반을 근거화하는 것이 전통적 캡션 기반 증류보다 시각적 이해를 더 효과적으로 향상시키는가?
RQ3대규모 이미지-코드-캡션 데이터(ICc-1M)와 STEM2Code-Eval 벤치마크가 코드 기반 인식을 안정적으로 학습 및 평가할 수 있는가?
RQ4코드 기반 작업이 STEM 도메인 전반에서 인식 및 실행 가능한 코드 정확도에 측정 가능한 향상을 가져오는가?

주요 결과

확대 실험은 STEM 시각적 작업에서 인식 개선이 추론 확장보다 더 큰 이득을 냄을 보여줍니다.
CodePercept는 이미지 이해 및 재구성을 위한 실행 가능 코드라는Ground Truth를 활용하여 인식 개선을 달성합니다.
STEM2Code-Eval은 이미지를 충실히 재현하는 실행 가능한 코드를 요구함으로써 결정론적 평가를 제공하여 검증 가능한 인식 평가를 가능하게 합니다.
ICC-1M은 Code-Grounded Caption Generation 및 STEM Image-to-Code Translation의 학습을 가능하게 하여 인식 및 정확한 코드 생성을 향상시킵니다.
실험 결과, CodePercept 기반 모델은 인식 지향 벤치마크 및 관련 코드 생성 작업에서 기준선보다 우월한 성능을 보였습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.