QUICK REVIEW

[論文レビュー] Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

Kishan Panaganti, Zhenwen Liang|arXiv (Cornell University)|Jan 27, 2026

Artificial Intelligence in Healthcare and Education被引用数 0

ひとこと要約

要約: 本研究は、オンライン難易度に基づいてプロンプトを動的に分割し、グループ間でロールアウトを割り当てることでLLMの推論を改善するMulti-Adversary GDROフレームワークを導入し、GRPOより顕著な改善を達成する。

ABSTRACT

Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.

研究の動機と目的

推論タスクにおける難易度分布の重尾性のため、非均一なトレーニングを動機づける。
データに依存しないオンライン難易度分類子を提案し、ダイナミックなグループにプロンプトを分割する。
サンプリングと計算割り当てを最適化するための2つの独立したGDROベースの敵対者（Prompt-GDROとRollout-GDRO）を開発する。
エントロピー正則化GDROおよび分散代理分析との理論的関係を提供する。
複数のモデルスケールでDAPO 14.1kに対して実証的改善を示す。

提案手法

オンライン難易度分類子を定義し、ダイナミックなpass@kベースのビンにプロンプトを分割する。
EMAバイアスを排除したEXP3Pを用いてGRPOの更新をビン難易度で再重み付けするPrompt-GDROを実装する。
平均予算制約の下でビン間にロールアウトを割り当てる計算アドバサリーとしてRollout-GDROを実装する。
集約損失（平均）を追跡し周波数バイアスを回避するためのEMAスコアを使用する。
勾配分散削減を最大化する影の価格 mu を用いた制約付き最適化としてRollout-GDROを定式化する。
柔らかな最悪グループ目的とno-regret保証を示すエントロピーGDRO解釈を提供する。

Figure 1: Beyond Uniform Reasoning—A Multi-Adversary Post-Training Framework. Plots on the right represent training steps tail averages ( $\geq$ 60th percentile) capturing the curriculum. (Left) Our framework significantly outperforms the standard GRPO baseline across mathematical reasoning benchmar

実験結果

リサーチクエスチョン

RQ1動的でデータに依存しない難易度グルーピングは、静的な一様サンプリングと比較してLLM推論の学習信号を向上させるか。
RQ2サンプリングとロールアウト予算配分の2つの独立したGDRO敵対者は、推論タスクのポストトレーニングにおいてGRPOを上回る付加的利得をもたらすか。
RQ3EMAバイアス排除と分散認識の割り当ては、最悪グループの堅牢性と勾配分散にどのように影響するか。
RQ4提案された敵対的フレームワークを支持する理論的保証や解釈（エントロピー正則化GDROと分散代理）とは何か。
RQ5提案手法は、DAPO推論データセットの異なるモデルスケールで実測的改善を生み出すか。

主な発見

Prompt-GDROは、GRPOと比較して1.7B、4B、8B Qwen3-Baseモデルでpass@8を約9.74%から13.13%改善。
Rollout-GDROは、同じモデルスケールでGRPOと比較してpass@8を約10.64%から9.20%改善。
フレームワークは、進化する推論フロンティアへリソースをシフトする新たなカリキュラムを生み出す。
EMAバイアス排除スコアリングは周波数バイアスを回避し、多様なアクティブ難易度グループを維持する。
理論的基盤はPrompt-GDROをエントロピー正則化GDROの代理解へ結び付け、no-regret解釈を与える。
平方根法則は、計算中立な予算の下で分散最適ロールアウト割り当てを動機づける。

Figure 2: Conceptual Illustration: Static Uniformity vs. Multi-Adversary GDRO (Dynamic). (Left) Standard GRPO samples prompts uniformly ( $q=1/B$ ) and assigns a fixed number of rollouts (schematically $N=16$ ), causing it to overfit easy tasks while under-exploring the frontier. (Right) Our framewo

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。