Skip to main content
QUICK REVIEW

[论文解读] Statsformer: Validated Ensemble Learning with LLM-Derived Semantic Priors

Erica Zhang, Naomi Sagan|arXiv (Cornell University)|Jan 29, 2026
Topic Modeling被引用 0
一句话总结

Statsformer 将来自大型语言模型的语义先验整合到一个线性与非线性学习器的集成中,并通过交叉验证剪枝,相对于基学习器的凸组合提供 oracle 风格的保证,在先验信息有用或嘈杂时也能实现稳健性能。

ABSTRACT

We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.An open-source implementation of Statsformer is available at https://github.com/pilancilab/statsformer.

研究动机与目标

  • 以 principled、数据证实的方式,推动从大型语言模型获取的语义先验在监督学习中的整合。
  • 开发一个模型无关的框架,通过单调适配器将 LLM 先验注入到不同的基学习器中。
  • 提供理论保证,集合能够与任意基学习器的凸组合竞争,误差在统计误差范围内。
  • 在高维、样本量较少的表格数据集上展示实际效果和鲁棒性。

提出的方法

  • 将来自 LLM 的先验 V 定义为一个对特征的非负向量的先验。
  • 通过单调映射 tau_alpha 将变换后的先验注入,alpha 在有限集合内,应用于每个特征先验(权重、尺度或实例权重适配器)。
  • 构建一个包含先验注入的基学习器字典,包括无先验基线(alpha=0, beta=0)。
  • 在所有基学习器配置上进行 out-of-fold (OOF) stacking,获得数据驱动的聚合权重 pi,落在简单集合中。
  • 在完整数据上重新拟合选定配置,并使用 OOF 推导的权重形成最终的 Statsformer 预测器的凸组合。
  • 提供三种具体的先验注入实现:基于惩罚、特征重权重化、以及实例权重注入。
  • 给出理论的 oracle 保证,将交叉验证风险与总体风险联系起来,并对错配的先验具有鲁棒性。
Figure 1 : Statsformer performance on a variety of datasets, compared to a variety of baseline methods. Note that, due to computational constraints, we only included the AutoML-Agent baseline in Bank Marketing, ETP, and Lung Cancer (see Table 3 in the Appendix for a more detailed computational compa
Figure 1 : Statsformer performance on a variety of datasets, compared to a variety of baseline methods. Note that, due to computational constraints, we only included the AutoML-Agent baseline in Bank Marketing, ETP, and Lung Cancer (see Table 3 in the Appendix for a more detailed computational compa

实验结果

研究问题

  • RQ1LLM 派生的语义先验能否以 principled、经过验证的方式纳入监督学习?
  • RQ2应如何校准先验的强度和形式以在最大化预测性能的同时防止幻觉?
  • RQ3具有经过验证先验的聚合集成是否能在统计误差范围内与最佳基学习器的凸组合竞争?
  • RQ4该方法是否具备可扩展性、模型无关性,以及在多样化高维表格数据集中的鲁棒性?

主要发现

  • 在多种表格数据集上,Statsformer 相对于无先验堆叠实现了稳定的改进,尤其在高维、样本量小的情境中。
  • 该框架在先验不可靠或不具信息时,对不可靠的先验进行削弱并平滑降级到无先验基线。
  • oracle 型保证显示聚合预测器在统计误差项范围内与候选学习器的最佳凸组合相匹配。
  • 实验结果在多数据集和多种 LLM 选择上表现出提升,规模更大、能力更强的 LLM 带来更显著的改进。
  • 对抗性仿真表明鲁棒性:当先验被系统性地颠倒时,性能趋近于基线堆叠。
Figure 2 : Direct accuracy and AUROC comparison of Statsformer to Statsformer (no prior) for selected datasets. Gains are noticeable across all four examples, and significant for ETP. See Figure 11 in the Appendix for datasets not shown here.
Figure 2 : Direct accuracy and AUROC comparison of Statsformer to Statsformer (no prior) for selected datasets. Gains are noticeable across all four examples, and significant for ETP. See Figure 11 in the Appendix for datasets not shown here.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。