QUICK REVIEW

[论文解读] The Magic Correlations: Understanding Knowledge Transfer from Pretraining to Supervised Fine-Tuning

Simin Fan, Dimitris Paparas|arXiv (Cornell University)|Feb 11, 2026

Topic Modeling被引用 0

一句话总结

该论文分析在预训练期间学到的能力在受监督微调（SFT）中的迁移情况，使用跨数据混合、模型规模和基准的相关性协议揭示何时迁移可靠以及校准如何演变。

ABSTRACT

Understanding how language model capabilities transfer from pretraining to supervised fine-tuning (SFT) is fundamental to efficient model development and data curation. In this work, we investigate four core questions: RQ1. To what extent do accuracy and confidence rankings established during pretraining persist after SFT? RQ2. Which benchmarks serve as robust cross-stage predictors and which are unreliable? RQ3. How do transfer dynamics shift with model scale? RQ4. How well does model confidence align with accuracy, as a measure of calibration quality? Does this alignment pattern transfer across training stages? We address these questions through a suite of correlation protocols applied to accuracy and confidence metrics across diverse data mixtures and model scales. Our experiments reveal that transfer reliability varies dramatically across capability categories, benchmarks, and scales -- with accuracy and confidence exhibiting distinct, sometimes opposing, scaling dynamics. These findings shed light on the complex interplay between pretraining decisions and downstream outcomes, providing actionable guidance for benchmark selection, data curation, and efficient model development.

研究动机与目标

评估从预训练获得的准确性和置信度排序在受监督微调（SFT）后是否仍然保持。
识别在后续阶段中可靠预测SFT后性能的基准，以及哪些基准不可靠。
表征在不同数据混合下，随着模型规模的扩大，迁移动态如何改变。
检查模型置信度与准确度的对齐程度（校准），以及这种对齐是否在训练阶段间持续。

提出的方法

在两种规模（240M 和 1B 参数）下训练仅解码器的Transformer模型。
通过将网页、代码和精选来源以不同比例组合，创建9种多样化的预训练数据混合。
在单一SFT数据集（Tulu-v2-mix）上对预训练检查点进行5轮微调。
在四个能力类别（常识、科学、NLI、语义）上的20个基准上进行评估。
计算跨阶段的混合数据的准确性和置信度相关性，以评估迁移的可靠性。
分析同类内一致性、跨阶段校准，以及模型规模对迁移模式的影响。

Figure 1 : Cross-stage correlation by capability category. (a) Accuracy correlation : the 1B model generally shows higher transferability; (b) Confidence correlation : 240M maintains substantially higher correlation especially in Commonsense ( 0.87 vs. 0.40 ) and Science ( 0.82 vs. 0.49 ) domains. T

实验结果

研究问题

RQ1在SFT后，预训练阶段的准确性和置信度排序在多大程度上仍然保持？
RQ2哪些基准在跨阶段中作为可靠的早期预测指标，哪些则不是？
RQ3随着模型规模的变化，迁移动态如何改变？
RQ4模型置信度与准确度的对齐程度如何，且这种校准模式是否在训练阶段间持续？

主要发现

模型规模提升（1B）通常会提高跨阶段的准确性相关性，240M 时较低。
在较小规模（240M）时，置信度迁移更强；在较大规模（1B）时较弱，并且呈现出与类别相关的不同模式。
常识和科学基准显示较高的跨阶段准确性相关性，而NLI和语义基准的迁移较弱。
在240M时，相关的常识和科学任务的置信度模式高度持续（例如跨阶段平均置信度相关性约为0.87和0.82，分别）。
同类内一致性随规模变化而改变：较小的模型在一个类别内存在竞争关系，较大模型表现出协同，尤其在科学任务中。
科学任务中的置信度与准确度高度对齐（r_align ~ 0.8），而常识与语义任务则显示出在SFT中持续存在的错配校准。
教育过滤数据（FineWeb-Edu）在规模依赖下对准确性和校准产生影响，在240M时提升某些任务，但在1B时有时会降低它们。

Figure 2 : Cross-stage Correlation across various benchmarks. Each bar shows the Pearson correlation between PT and SFT performance on the certain benchmark across data mixtures. (a) Accuracy Correlation : the 1B model achieves higher transferrability than 240M (in average $\bar{r}$ = $\small 0.59$

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。