[论文解读] SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning
SED-SFT 引入带掩蔽机制的选择性熵正则化,仅在具有足够探索空间的标记上鼓励多样性,从而在最小开销下改善 RL 结果。
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often induces mode collapse, where models over-concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED-SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED-SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE-based baselines on Llama-3.2-3B-Instruct and Qwen2.5-Math-7B-Instruct, respectively. The code is publicly available at https://github.com/pppa2019/SED-SFT
研究动机与目标
- Motivate the need to mitigate mode collapse in standard SFT driven by Cross-Entropy loss.
- Identify token-level exploration space as a key factor limiting diversity and downstream RL gains.
- Propose SED-SFT to selectively regularize prediction probabilities based on token exploration space.
- Demonstrate that SED-SFT yields diversity gains and better RL performance on math benchmarks with two backbones.
提出的方法
- Introduce a selective masking mechanism M_t to decide where to apply diversity encouragement.
- Define P_Top-k(t) as the cumulative probability of the top-k tokens at position t and set M_t = 1 if P_Top-k(t) < tau.
- Use a quadratic diversity-encouraging penalty L_DE(p) = (p - 0.5)^2 on the ground-truth token probability p when masked.
- Combine CE loss with the masked diversity penalty: L_SED-SFT = sum_t [-log pi_theta(y_t^* | x, y_<t) + lambda * M_t * L_DE(pi_theta(y_t^* | x, y_<t))].
- Tune tau via the (1-r)-quantile of observed P_Top-k across samples, with r as the masking ratio.
- Set lambda = 1 in all experiments to balance diversity and accuracy.
实验结果
研究问题
- RQ1How does token-level exploration space affect diversity during SFT?
- RQ2Can selective entropy regularization improve downstream RL performance after SFT-then-RL pipelines?
- RQ3What is the impact of masking tokens with low exploration space on accuracy and diversity in mathematical reasoning tasks?
- RQ4Do SED-SFT gains generalize across different backbones and eight mathematical benchmarks?
主要发现
- SED-SFT consistently improves downstream RL performance over CE-based baselines on two backbones: average improvements of 2.06 points (Llama-3.2-3B-Instruct) and 1.20 points (Qwen2.5-Math-7B-Instruct).
- SED-SFT achieves higher generation diversity as shown by lower Self-BLEU scores compared to CE and DFT baselines.
- A masking strategy that suppresses diversity encouragement on tokens with low exploration space is crucial for maintaining accuracy while increasing diversity.
- DFT can boost SFT performance but severely restricts exploration space, limiting RL gains; GEM increases diversity but ignores token-specific exploration space.
- Hyperparameter sensitivity indicates robustness: SED-SFT outperforms CE when masking ratio r > 0.5 and k > 1 in top-k exploration.
- Sentence-level diversity (Self-BLEU) improves under SED-SFT and GEM relative to CE and DFT.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。