QUICK REVIEW

[論文レビュー] MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning

Youngeun Kim|arXiv (Cornell University)|Jan 30, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

MC-GRPOは平均中心ベースラインを中央値中心ベースラインに置換し、小ロールアウト予算時の学習安定性と精度を向上させる。中央値参照を作成するために1回分の追加ロールアウトを行い中央値を廃止、更新サイズを固定のまま符号反転を減少させる。

ABSTRACT

Group-relative policy optimization methods train language models by generating multiple rollouts per prompt and normalizing rewards with a shared mean reward baseline. In resource-constrained settings where the rollout budget is small, accuracy often degrades. We find that noise in the shared baseline induces advantage sign flips, where some rollouts receive an incorrect advantage sign, and the update direction is reversed. To address this, we propose Median-Centered Group Relative Policy Optimization (MC-GRPO), a simple and effective solution for small-rollout training. Our main idea is to replace the mean baseline with a median baseline: the median is far less sensitive to outlier rewards than the mean, mitigating the sign flips under small rollout size (G). We generate one additional rollout for median reference (G+1), and compute advantages by using the group median. With an odd-sized group, exactly one completion is the median and receives zero advantage, we exclude this pivot rollout from backpropagation so the number of gradient-contributing samples per prompt remains G, preserving the core update cost of standard G-rollout training. Across various GRPO-family methods and a wide range of models and scales, this median-centered training consistently improves stability and final accuracy in the low-rollout regime, reducing the gap between G=2 and G=8 to within 1%. Code is available at https://github.com/lotusroot-kim/MC-GRPO

研究の動機と目的

GRPO様式のベースラインが小さなロールアウト予算下で信頼性を欠く理由を特定する。
中央値中心のグループ相対ポリシー最適化（MC-GRPO）を提案し、ベースラインノイズを緩和する。
MC-GRPOが低ロールアウト領域においてGRPO系変種とモデル全般で安定性と最終精度を改善することを示す。
2ロールアウトと8ロールアウトの性能差をMC-GRPOが縮めることを示す。
外れ値に対する頑健性と分布外の数学ベンチマークへの一般化を評価する。

提案手法

各プロンプトごとにG+1回のロールアウトをサンプルして奇数サイズのグループを形成する。
グループベースラインをG+1回のロールアウトの報酬の中央値として計算する（b(q)）。
利得は (r_i - b(q)) を MAD(r) で割った値を小さな epsilon で除法して計算する。
中央値（ゼロ利得の完結）をバックプロパゲーションから除外してGの勾配寄与サンプルを維持する。
標準のGRPOグループ正規化利得を中央値中心の利得に置換し、既存のGRPO目的関数で適用する。
更新サイズと処理パイプラインを同じままGRPOファミリーメソッドのドロップイン置換とする。

Figure 1 : Accuracy (%) versus the number of rollouts for Qwen3-1.7B trained on GSM8K. We compare the original GRPO, DAPO, and DR-GRPO methods ( ; baselines) with their Median-Centered (MC) variants ( ; ours). MC training improves robustness and yields larger gains under small rollout budgets (2 $\s

実験結果

リサーチクエスチョン

RQ1中央値中心化は小ロールアウトGRPO型訓練における利得符号反転を減らすか。
RQ2MC-GRPOは小さなGでGRPOファミリーメソッドとモデルスケール全体で安定性と最終精度を一貫して改善できるか。
RQ3性能向上は単に追加のロールアウトを加えたことによるものより中央値ベースラインの頑健性によるものか。
RQ4小さなロールアウトで訓練した場合、MC-GRPOは分布外一般化を改善するか。
RQ5複合的（細粒度）報酬でのMC-GRPOの性能はどうなるか。

主な発見

中央値中心のベースラインは小さなロールアウト予算（Gが{2,4}のとき）で符号反転率を大幅に低減する。
MC-GRPOは複数モデル/データセットでGRPOを上回る精度向上をもたらし、G=2で最大 +4.62%、G=4で +2.35%〜 +2.67%の改善を報告設定で示す。
GRPO系の派生（GRPO, DAPO, DR-GRPO）全体で、低予算域での安定性と最終精度が向上し、高予算（G=8）でも競争力を維持する。
MC-GRPOは2ロールアウトと8ロールアウトの性能差を報告ケースで1%以内に縮める。
AIME-24およびAMC-23の分布外ゼロショット精度は、小ロールアウト時のMC-GRPOでGRPOより改善される。
MC-GRPOは複合離散報酬（r_acc + r_fmt）下でも有効で、追加のサンプリング制御に対しても中央値ベースライン機構の頑健性を示す。

Figure 2 : Sign flips are frequent under small rollout budgets. (a) With few rollouts, the sample-mean baseline can shift substantially depending on which rollouts are included, causing an advantage sign flip for the same trajectory ( e.g. , the $0.5$ -reward sample flips sign when the rollout set c

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。