QUICK REVIEW

[论文解读] CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Tongkun Guan, Zhibo Yang|arXiv (Cornell University)|Mar 11, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

本论文将感知视为 STEM 视觉推理的主要瓶颈，并提出 CodePercept，一种以代码为基础的框架，包含 ICC-1M 和 STEM2Code-Eval，以可执行代码提升视觉感知。

ABSTRACT

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

研究动机与目标

通过缩放分析确定感知是否限制 MLLMs 的 STEM 视觉推理。
提出以代码为基础的感知范式，以提升 STEM 领域的视觉理解。
构建大规模数据与基准，用于训练和评估以代码为基础的感知。
证明代码可以匹配甚至超越基于字幕的方法在感知任务上的表现。

提出的方法

将 STEM 视觉推理解耦为感知（图像到字幕）和推理（字幕到答案）两个阶段，并独立扩大每个组件的规模。
通过生成 image-code-caption 三元组，将感知建立在可执行的 Python 代码上；使用代码作为精确语义媒介。
通过三条管线创建 ICC-1M，即 1M+ 图像-字幕-代码数据集：图像再现、图像多样性、以及用模板的坚实几何合成。
提出 STEM2Code-Eval，这是一项 1,000 图像的基准，模型生成可执行代码来重建图像，从而实现确定性感知评估。
使用两类 Code-Grounded 任务训练模型：Code-Grounded Caption Generation 和 STEM Image-to-Code Translation，采用监督微调与带显式奖励的强化学习（GRPO）。
采用两阶段训练方案（CodePercept-S1 与 CodePercept-R1）共同学习字幕与代码生成，以代码作为可验证的监督信号。

实验结果

研究问题

RQ1是否可以确定感知是 MLLMs 在 STEM 视觉推理中的瓶颈？
RQ2将感知绑定在可执行代码上是否比传统字幕蒸馏更有效地提升视觉理解？
RQ3大规模的图像-代码-字幕数据（ICC-1M）和 STEM2Code-Eval 基准能否可靠地训练与评估以代码为基础的感知？
RQ4以代码为基础的任务是否在 STEM 领域带来感知与可执行代码准确率的可观提升？

主要发现

缩放实验显示，感知的改进在 STEM 视觉任务中带来的提升大于推理的扩大。
CodePercept 通过将可执行代码作为图像理解和重建的地面真值，实现感知方面的提升。
STEM2Code-Eval 通过要求能够真实再现图像的可执行代码来提供确定性评估，从而实现可验证的感知评估。
ICC-1M 使对 Code-Grounded Caption Generation 与 STEM Image-to-Code Translation 的训练成为可能，提升感知与精确代码生成。
实验结果表明基于 CodePercept 的模型在感知取向的基准和相关的代码生成任务中优于基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。