QUICK REVIEW

[論文レビュー] ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

Junjie Huang, Jiarui Qin|arXiv (Cornell University)|Feb 3, 2026

Topic Modeling被引用数 0

ひとこと要約

ReMiT は mid-training で RL 見本を用いて token を動的にリウェイトし、ベースモデルの性能を向上させ、post-training でのゲインを維持することで pre-training と post-training の自己強化的フライホイールを実現する。

ABSTRACT

Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.

研究の動機と目的

Identify mid-training as a critical turning point for LLM capabilities.
Propose a token-level dynamic reweighting mechanism guided by an RL-ref model.
Enable bidirectional influence between post-training and pre-training without external teachers.
Demonstrate that mid-training improvements transfer and amplify during post-training across model families.

提案手法

Introduce ReMiT, a token-level reweighting scheme that uses the RL-tuned model as a reference during mid-training.
Compute per-token loss discrepancy between base and RL reference, center the delta loss per sequence, and map it to weights with a clipped scaled sigmoid.
Integrate the weights into the mid-training objective as a soft reweighting of the standard next-token prediction loss.
Use the in-pipeline RL-tuned model as the reference to avoid external teachers.
Provide theoretical justification linking ReMiT to KL-divergence toward an implicit target distribution and to KL-regularized RL.
Experiment with three open-source base-model families (OLMo-1B, SmolLM3-3B, Youtu-LLM-2B), comparing ReMiT against baselines across 10 downstream benchmarks.]
research_questions:[
Can mid-training reweighting guided by an RL reference improve the base model’s capabilities.
Do mid-training gains transfer and persist through post-training stages (SFT, DPO, RLVR)?
Does ReMiT offer advantages over knowledge distillation and token-level data filtering approaches?

実験結果

リサーチクエスチョン

RQ1Mid-training のリウェイトが RL reference に guided されるとベースモデルの能力を改善できるか？
RQ2mid-training のゲインは post-training 段階（SFT, DPO, RLVR）を通じて移転・持続するか？
RQ3ReMiT はナレッジディスティレーションやトークンレベルのデータフィルタリング手法より有利か？

主な発見

Model family	Pre-Trained	Vanilla NTP	MiniPLM	RHO-1	ReMiT	Avg.
OLMo-1B	3.03	48.14	48.45	50.42	61.64	27.56
MATH	2.94	10.26	9.60	10.32	14.50	9.91
GPQA	20.31	22.54	23.21	25.45	24.55	23.02
BBH	28.43	30.87	30.38	29.33	32.07	30.22
IFE	22.66	16.19	16.79	19.06	28.54	20.84
HE	6.71	8.54	7.32	6.71	12.80	8.22
MBPP	4.80	4.60	6.80	6.20	9.20	6.28
TQA	21.30	22.40	23.13	23.38	25.58	23.48
ARC-C	44.71	46.67	45.31	46.42	49.23	46.07
MMLU-P	9.54	13.31	13.15	13.68	17.44	13.62
Avg.	16.44	22.35	22.41	23.10	27.56	22.58

ReMiT は 10 の pre-training ベンチマークで平均 3% の改善をモデルファミリ間で達成。
Mid-training のゲインは post-training に移転し、パイプライン全体で 2% 以上の改善を持続。
ReMiT は downstream のタスクで Vanilla NTP、MiniPLM、RHO-1 などのベースラインより優れている。
本手法は外部教師なしで base モデルと RL モデルの共同でのフライホイールを可能にする。
トークンウェイトをクリップすることは、ピボタルなトークンを強調しつつ訓練の安定性とデータ整合性を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。