QUICK REVIEW

[论文解读] STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

Xuzhao Li, Xuchen Li|arXiv (Cornell University)|Jan 14, 2026

Topic Modeling被引用 0

一句话总结

STEMVerse 通过将问题映射到学科专门化与 Bloom 认知水平的二维空间，重新定义了 LLM 的 STEM 评估，使对推理能力的诊断分析比单一准确度更细粒度。

ABSTRACT

As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $ imes$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.

研究动机与目标

解决 STEM 评估被割裂在单一分数基准中的碎片化问题。
引入耦合学科子领域与 Bloom 分类法的双轴能力矩阵。
将来自多个基准的 20,374 道 STEM 题重新聚合到统一的学科 × 认知空间。
在不同规模的开源 LLM 家族上进行评估，以识别结构性认知瓶颈与非线性增长模式。

提出的方法

将跨基准数据重新聚合到四大 STEM 支柱（数学、物理、化学、生物），并细分子学科。
在两个轴上标注题目：学科专门化与 Bloom 的认知水平。
混合标注流程：使用 GPT-4o 进行标注并由专家人工复核以确保可靠性（IAA 0.87–0.92）。
构建双轴能力矩阵，将模型在学科与认知层级上的准确率映射。
评估过程中的小样本提示协议，确保跨模型可比性；在矩阵中以准确率作为局部诊断指标。

实验结果

研究问题

RQ1当按 Bloom 的认知水平对细粒度学科子领域进行评估时，LLMs 的表现如何？
RQ2传统的单一分数基准是否掩盖 STEM 推理中的知识 vs. 推理缺陷？
RQ3在学科 × 认知谱上，规模和训练对 STEM 推理的影响为何？
RQ4在高阶 STEM 推理中，是否存在结构性瓶颈（如逻辑-符号崩溃）在不同模型家族间？
RQ5不同规模和训练范式的开源模型如何在 STEMVerse 空间中分布其能力？

主要发现

双轴视角揭示能力的非线性演化，聚合分数掩盖了学科与认知特定的差距。
学科层面的结果显示知识在分袋化的模式；没有小于 14B 的模型在物理化学方面超过 38.0%，而 Qwen3-14B-Instruct 在分析化学达到 32.5%、在神经科学与心理学达到 58.3%。
认知结果在 Understand 级别达到峰值，但在 Biology、Physics、Chemistry 的 Apply 水平出现下降，并在符号-heavy 的领域向更高阶任务移动时出现显著的逻辑-符号崩溃。
参数扩展呈现非线性增益；Remember 等级对 Qwen3 的提升大致每跳 +10%，但 Understand 显现阈值效应（如从 8B→14B，约从 ~60% 提升到 ~90%）。
指令微调可以降低复杂推理路径，提高可控性，但可能在数学子学科中削弱高阶符号推理。
该框架揭示了高阶科学推理训练范式中的结构性缺陷，并凸显跨学科与规模的非线性增长模式。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。