QUICK REVIEW

[論文レビュー] STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

Xuzhao Li, Xuchen Li|arXiv (Cornell University)|Jan 14, 2026

Topic Modeling被引用数 0

ひとこと要約

STEMVerseはLLMsのSTEM推論の二軸診断フレームワークを提供し、 дисциплinesと認知レベルを横断して知識と推論のギャップを診断する。

ABSTRACT

As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $ imes$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.

研究の動機と目的

STEM評価が siloed benchmarks への断片化と単一スコアに収束する現状を解消する。
学問的サブディシプリンとBloomの分類法を結ぶ二軸の能力マトリクスを導入する。
複数ベンチマークから20,374のSTEM問題を統一された discipline × cognition 空間へ再編成する。
オープンソースLLMファミリをスケール別に評価し、構造的な認知ボトルネックと非線形成長パターンを特定する。

提案手法

4つのSTEMの柱（数理、物理、化学、生物）へ細粒度のサブディシプリンを用いたベンチマーク横断データ再集約。
問題を二軸（学問的専門性とBloomの認知レベル）で注釈付け。
ハイブリッド注釈パイプライン：ラベリングにGPT-4oを用い、信頼性を確保するため専門家による手動監査を実施（IAA 0.87–0.92）。
モデルの精度を学問分野と認知階層へマッピングする二軸能力マトリクスを構築。
Cross-model comparabilityを確保する評価時のFew-shot promptingプロトコル；マトリクス内の局所診断指標としてAccuracyを採用。

実験結果

リサーチクエスチョン

RQ1Bloomの認知レベルに沿って細粒度の学問サブディシプリンを横断してLLMがどのように性能を発揮するか？
RQ2従来の単一スコアベンチマークは、STEM推論における知識と推論の欠陥を隠していないか？
RQ3 discipline × cognitionスペクトルに沿ったSTEM推論のスケーリングと学習効果はどうなるか？
RQ4高次のSTEM推論における構造的ボトルネック（例：論理-記号の崩壊）はモデルファミリ間で見られるか？
RQ5オープンソースモデルのサイズや学習パラダイムの異なるものは、STEMVerse空間における能力分布をどう示すか？

主な発見

二軸の視点は、能力の非線形的な進化を示し、集計スコアが分野別・認知別のギャップを覆い隠すことがある。
分野別の結果は知識のサイロ化パターンを示し、物理化学で38.0%を超えるモデルは14B未満ではおらず、Qwen3-14B-InstructはAnalytical Chemistryで32.5%、Neuroscience and Psychologyで58.3%を達成している。
認知的結果はUnderstandレベルでピーク、Biology・Physics・ChemistryでApplyで低下、記号を多用する分野で高次の課題へ移行する際に論理-記号崩壊が顕著である。
パラメータスケーリングは非線形の利得を生み出す；Remember階はQwen3で約+10%ずつ増加するが、Understandには閾値効果（例：8B→14Bで約60%から約90%へ増加）がある。
Instruction-tuningは複雑な推論経路を縮小し、制御性を向上させる一方で、数学のサブディシプリンにおける高次の記号推論を低下させる可能性がある。
本フレームワークは高次の科学的推論の学習パラダイムにおける構造的欠陥を示し、分野とスケールを横断する非線形成長パターンを浮き彫りにする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。