[論文レビュー] ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution
ReMiT は mid-training で RL 見本を用いて token を動的にリウェイトし、ベースモデルの性能を向上させ、post-training でのゲインを維持することで pre-training と post-training の自己強化的フライホイールを実現する。
Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.
研究の動機と目的
- Identify mid-training as a critical turning point for LLM capabilities.
- Propose a token-level dynamic reweighting mechanism guided by an RL-ref model.
- Enable bidirectional influence between post-training and pre-training without external teachers.
- Demonstrate that mid-training improvements transfer and amplify during post-training across model families.
提案手法
- Introduce ReMiT, a token-level reweighting scheme that uses the RL-tuned model as a reference during mid-training.
- Compute per-token loss discrepancy between base and RL reference, center the delta loss per sequence, and map it to weights with a clipped scaled sigmoid.
- Integrate the weights into the mid-training objective as a soft reweighting of the standard next-token prediction loss.
- Use the in-pipeline RL-tuned model as the reference to avoid external teachers.
- Provide theoretical justification linking ReMiT to KL-divergence toward an implicit target distribution and to KL-regularized RL.
- Experiment with three open-source base-model families (OLMo-1B, SmolLM3-3B, Youtu-LLM-2B), comparing ReMiT against baselines across 10 downstream benchmarks.]
- research_questions:[
- Can mid-training reweighting guided by an RL reference improve the base model’s capabilities.
- Do mid-training gains transfer and persist through post-training stages (SFT, DPO, RLVR)?
- Does ReMiT offer advantages over knowledge distillation and token-level data filtering approaches?
実験結果
リサーチクエスチョン
- RQ1Mid-training のリウェイトが RL reference に guided されるとベースモデルの能力を改善できるか?
- RQ2mid-training のゲインは post-training 段階(SFT, DPO, RLVR)を通じて移転・持続するか?
- RQ3ReMiT はナレッジディスティレーションやトークンレベルのデータフィルタリング手法より有利か?
主な発見
| Model family | Pre-Trained | Vanilla NTP | MiniPLM | RHO-1 | ReMiT | Avg. |
|---|---|---|---|---|---|---|
| OLMo-1B | 3.03 | 48.14 | 48.45 | 50.42 | 61.64 | 27.56 |
| MATH | 2.94 | 10.26 | 9.60 | 10.32 | 14.50 | 9.91 |
| GPQA | 20.31 | 22.54 | 23.21 | 25.45 | 24.55 | 23.02 |
| BBH | 28.43 | 30.87 | 30.38 | 29.33 | 32.07 | 30.22 |
| IFE | 22.66 | 16.19 | 16.79 | 19.06 | 28.54 | 20.84 |
| HE | 6.71 | 8.54 | 7.32 | 6.71 | 12.80 | 8.22 |
| MBPP | 4.80 | 4.60 | 6.80 | 6.20 | 9.20 | 6.28 |
| TQA | 21.30 | 22.40 | 23.13 | 23.38 | 25.58 | 23.48 |
| ARC-C | 44.71 | 46.67 | 45.31 | 46.42 | 49.23 | 46.07 |
| MMLU-P | 9.54 | 13.31 | 13.15 | 13.68 | 17.44 | 13.62 |
| Avg. | 16.44 | 22.35 | 22.41 | 23.10 | 27.56 | 22.58 |
- ReMiT は 10 の pre-training ベンチマークで平均 3% の改善をモデルファミリ間で達成。
- Mid-training のゲインは post-training に移転し、パイプライン全体で 2% 以上の改善を持続。
- ReMiT は downstream のタスクで Vanilla NTP、MiniPLM、RHO-1 などのベースラインより優れている。
- 本手法は外部教師なしで base モデルと RL モデルの共同でのフライホイールを可能にする。
- トークンウェイトをクリップすることは、ピボタルなトークンを強調しつつ訓練の安定性とデータ整合性を維持する。
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。