Skip to main content
QUICK REVIEW

[论文解读] Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

Kishan Panaganti, Zhenwen Liang|arXiv (Cornell University)|Jan 27, 2026
Artificial Intelligence in Healthcare and Education被引用 0
一句话总结

该论文提出了一个多对手的 GDRO 框架,通过在线难度对提示进行动态分区,并在各组之间分配滚动进行推理,以提升大型语言模型的推理能力,相较于 GRPO 取得显著提升。

ABSTRACT

Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.

研究动机与目标

  • 出于难度分布呈重尾特征,动机在于对推理任务进行非均匀训练。
  • 提出一个与数据无关的在线难度分类器,将提示分成动态分组。
  • 开发两个独立的基于 GDRO 的对手(Prompt-GDRO 与 Rollout-GDRO),以优化采样与计算分配。
  • 给出与熵正则化 GDRO 和方差代理分析的理论联系。
  • 在 DAPO 14.1k 数据集与多种模型尺度上展示经验改进。

提出的方法

  • 定义一个将提示分成基于在线难度的动态通过率 bin 的分类器。
  • 实现基于 EMA 去偏的 EXP3P 的 Prompt-GDRO,通过按分组难度重新加权 GRPO 更新。
  • 实现 Rollout-GDRO,作为一个计算对手,在均值预算约束下在各分组之间分配滚动。
  • 使用 EMA 分数跟踪密集(均值)损失,避免频率偏差。
  • 将 Rollout-GDRO 表述为带影子价格 mu 的受约束优化,以最大化梯度方差的降低。
  • 提供熵 GDRO 的解释,显示一个软最差组目标且具备无后悔保障。
Figure 1: Beyond Uniform Reasoning—A Multi-Adversary Post-Training Framework. Plots on the right represent training steps tail averages ( $\geq$ 60th percentile) capturing the curriculum. (Left) Our framework significantly outperforms the standard GRPO baseline across mathematical reasoning benchmar
Figure 1: Beyond Uniform Reasoning—A Multi-Adversary Post-Training Framework. Plots on the right represent training steps tail averages ( $\geq$ 60th percentile) capturing the curriculum. (Left) Our framework significantly outperforms the standard GRPO baseline across mathematical reasoning benchmar

实验结果

研究问题

  • RQ1相比静态均匀采样,动态、数据无关的难度分组是否能在 LLM 推理中改善学习信号?
  • RQ2两个独立的 GDRO 对手(采样与滚动预算)是否在推理任务的后训练中相对于 GRPO 有叠加增益?
  • RQ3EMA 去偏与方差感知的分配如何影响最差组鲁棒性与梯度方差?
  • RQ4哪些理论保证或解释支持所提出的对手框架(熵正则化 GDRO 与方差代理)?
  • RQ5在 DAPO 推理数据集的不同模型尺度上,所提方法是否带来可度量的改进?

主要发现

  • Prompt-GDRO 相较于 GRPO,在 1.7B、4B 与 8B Qwen3-Base 模型上将 pass@8 提升约 9.74% 到 13.13%。
  • Rollout-GDRO 相较于 GRPO,在相同模型尺度上将 pass@8 提升约 10.64% 到 9.20%。
  • 该框架带来一种新兴的课程效果,资源向不断演化的推理前沿转移。
  • EMA 去偏评分避免了频率偏差,并保持了多样的活跃难度组。
  • 理论基础将 Prompt-GDRO 与熵正则化的 GDRO 对应起来,并具备无后悔解释。
  • 平方根规律在计算中性预算下为方差最优的滚动分配提供动机。
Figure 2: Conceptual Illustration: Static Uniformity vs. Multi-Adversary GDRO (Dynamic). (Left) Standard GRPO samples prompts uniformly ( $q=1/B$ ) and assigns a fixed number of rollouts (schematically $N=16$ ), causing it to overfit easy tasks while under-exploring the frontier. (Right) Our framewo
Figure 2: Conceptual Illustration: Static Uniformity vs. Multi-Adversary GDRO (Dynamic). (Left) Standard GRPO samples prompts uniformly ( $q=1/B$ ) and assigns a fixed number of rollouts (schematically $N=16$ ), causing it to overfit easy tasks while under-exploring the frontier. (Right) Our framewo

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。