Skip to main content
QUICK REVIEW

[論文レビュー] ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

Junjie Huang, Jiarui Qin|arXiv (Cornell University)|Feb 3, 2026
Topic Modeling被引用数 0
ひとこと要約

ReMiT は mid-training で RL 見本を用いて token を動的にリウェイトし、ベースモデルの性能を向上させ、post-training でのゲインを維持することで pre-training と post-training の自己強化的フライホイールを実現する。

ABSTRACT

Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.

研究の動機と目的

  • Identify mid-training as a critical turning point for LLM capabilities.
  • Propose a token-level dynamic reweighting mechanism guided by an RL-ref model.
  • Enable bidirectional influence between post-training and pre-training without external teachers.
  • Demonstrate that mid-training improvements transfer and amplify during post-training across model families.

提案手法

  • Introduce ReMiT, a token-level reweighting scheme that uses the RL-tuned model as a reference during mid-training.
  • Compute per-token loss discrepancy between base and RL reference, center the delta loss per sequence, and map it to weights with a clipped scaled sigmoid.
  • Integrate the weights into the mid-training objective as a soft reweighting of the standard next-token prediction loss.
  • Use the in-pipeline RL-tuned model as the reference to avoid external teachers.
  • Provide theoretical justification linking ReMiT to KL-divergence toward an implicit target distribution and to KL-regularized RL.
  • Experiment with three open-source base-model families (OLMo-1B, SmolLM3-3B, Youtu-LLM-2B), comparing ReMiT against baselines across 10 downstream benchmarks.]
  • research_questions:[
  • Can mid-training reweighting guided by an RL reference improve the base model’s capabilities.
  • Do mid-training gains transfer and persist through post-training stages (SFT, DPO, RLVR)?
  • Does ReMiT offer advantages over knowledge distillation and token-level data filtering approaches?

実験結果

リサーチクエスチョン

  • RQ1Mid-training のリウェイトが RL reference に guided されるとベースモデルの能力を改善できるか?
  • RQ2mid-training のゲインは post-training 段階(SFT, DPO, RLVR)を通じて移転・持続するか?
  • RQ3ReMiT はナレッジディスティレーションやトークンレベルのデータフィルタリング手法より有利か?

主な発見

Model familyPre-TrainedVanilla NTPMiniPLMRHO-1ReMiTAvg.
OLMo-1B3.0348.1448.4550.4261.6427.56
MATH2.9410.269.6010.3214.509.91
GPQA20.3122.5423.2125.4524.5523.02
BBH28.4330.8730.3829.3332.0730.22
IFE22.6616.1916.7919.0628.5420.84
HE6.718.547.326.7112.808.22
MBPP4.804.606.806.209.206.28
TQA21.3022.4023.1323.3825.5823.48
ARC-C44.7146.6745.3146.4249.2346.07
MMLU-P9.5413.3113.1513.6817.4413.62
Avg.16.4422.3522.4123.1027.5622.58
  • ReMiT は 10 の pre-training ベンチマークで平均 3% の改善をモデルファミリ間で達成。
  • Mid-training のゲインは post-training に移転し、パイプライン全体で 2% 以上の改善を持続。
  • ReMiT は downstream のタスクで Vanilla NTP、MiniPLM、RHO-1 などのベースラインより優れている。
  • 本手法は外部教師なしで base モデルと RL モデルの共同でのフライホイールを可能にする。
  • トークンウェイトをクリップすることは、ピボタルなトークンを強調しつつ訓練の安定性とデータ整合性を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。