QUICK REVIEW

[論文レビュー] CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Tongkun Guan, Zhibo Yang|arXiv (Cornell University)|Mar 11, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

この論文は STEM 視覚推論における知覚を主なボトルネックと見なし、ICC-1M と STEM2Code-Eval を備えた CodePercept というコードに基づくフレームワークを提示し、実行可能なコードを用いて視覚知覚を強化する。

ABSTRACT

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

研究の動機と目的

Scaling analyses を通じて、知覚が STEM 視覚推論をリミットしているかを特定する。
STEM 分野における視覚理解を向上させるコードに基づく知覚パラダイムを提案する。
コード基盤の知覚を訓練・評価する大規模データとベンチマークを構築する。
コードが視覚タスクの知覚に関して、キャプションベースのアプローチと同等またはそれを上回ることを実証する。

提案手法

STEM 視覚推論を知覚（image-to-caption）と推論（caption-to-answer）段階に分離し、各構成要素を独立してスケールさせる。
実行可能な Python コードで知覚を地付けするために、image-code-caption のトリプレットを生成する；コードを正確な意味媒体として用いる。
ICC-1M を作成する。1M+ の image-caption-code データセットを three pipelines（画像再現、画像多様性、テンプレートを用いた堅牢なジオメトリ合成）で構築する。
STEM2Code-Eval を提案する。1,000-画像のベンチマークで、モデルが画像を再現する実行可能コードを生成することを要求し、決定論的な知覚評価を可能にする。
Code-Grounded の二つのタスクでモデルを訓練する：Code-Grounded Caption Generation および STEM Image-to-Code Translation。教師ありファインチューニングと強化学習（GRPO、明示的報酬付き）を用いる。
二段階の訓練手法（CodePercept-S1 および CodePercept-R1）を採用し、キャプションとコード生成を共同学習させ、コードを検証可能な監督信号とする。

実験結果

リサーチクエスチョン

RQ1MLLMs における STEM 視覚推論のボトルネックとして知覚を特定できるか？
RQ2実行可能コードへの知覚の基づけは、従来のキャプション蒸留より視覚理解を効果的に向上させるか？
RQ3大規模な image-code-caption データ（ICC-1M）と STEM2Code-Eval ベンチマークは、コード基盤の知覚を信頼性高く訓練・評価できるか？
RQ4コード基盤のタスクは STEM 分野全体で知覚と実行可能コードの正確性に測定可能な改善をもたらすか？

主な発見

スケーリング実験は、知覚の改善が STEM 視覚タスクにおける推論のスケーリングより大きな利益を生むことを示す。
CodePercept は、画像理解と再構成のための実行可能コードを正解として活用することで知覚改善を達成する。
STEM2Code-Eval は、画像を忠実に再現する実行可能コードを要求することで決定論的評価を提供し、検証可能な知覚評価を可能にする。
ICC-1M は Code-Grounded Caption Generation および STEM Image-to-Code Translation の訓練を可能にし、知覚と正確なコード生成を改善する。
実験結果は、CodePercept ベースのモデルが知覚志向のベンチマークおよび関連するコード生成タスクでベースラインを上回ることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。