QUICK REVIEW

[论文解读] Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

Kishan Panaganti, Zhenwen Liang|arXiv (Cornell University)|Jan 27, 2026

Artificial Intelligence in Healthcare and Education被引用 0

一句话总结

该论文提出了一个多对手的 GDRO 框架，通过在线难度对提示进行动态分区，并在各组之间分配滚动进行推理，以提升大型语言模型的推理能力，相较于 GRPO 取得显著提升。

ABSTRACT

Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.

研究动机与目标

出于难度分布呈重尾特征，动机在于对推理任务进行非均匀训练。
提出一个与数据无关的在线难度分类器，将提示分成动态分组。
开发两个独立的基于 GDRO 的对手（Prompt-GDRO 与 Rollout-GDRO），以优化采样与计算分配。
给出与熵正则化 GDRO 和方差代理分析的理论联系。
在 DAPO 14.1k 数据集与多种模型尺度上展示经验改进。

提出的方法

定义一个将提示分成基于在线难度的动态通过率 bin 的分类器。
实现基于 EMA 去偏的 EXP3P 的 Prompt-GDRO，通过按分组难度重新加权 GRPO 更新。
实现 Rollout-GDRO，作为一个计算对手，在均值预算约束下在各分组之间分配滚动。
使用 EMA 分数跟踪密集（均值）损失，避免频率偏差。
将 Rollout-GDRO 表述为带影子价格 mu 的受约束优化，以最大化梯度方差的降低。
提供熵 GDRO 的解释，显示一个软最差组目标且具备无后悔保障。

Figure 1: Beyond Uniform Reasoning—A Multi-Adversary Post-Training Framework. Plots on the right represent training steps tail averages ( $\geq$ 60th percentile) capturing the curriculum. (Left) Our framework significantly outperforms the standard GRPO baseline across mathematical reasoning benchmar

实验结果

研究问题

RQ1相比静态均匀采样，动态、数据无关的难度分组是否能在 LLM 推理中改善学习信号？
RQ2两个独立的 GDRO 对手（采样与滚动预算）是否在推理任务的后训练中相对于 GRPO 有叠加增益？
RQ3EMA 去偏与方差感知的分配如何影响最差组鲁棒性与梯度方差？
RQ4哪些理论保证或解释支持所提出的对手框架（熵正则化 GDRO 与方差代理）？
RQ5在 DAPO 推理数据集的不同模型尺度上，所提方法是否带来可度量的改进？

主要发现

Prompt-GDRO 相较于 GRPO，在 1.7B、4B 与 8B Qwen3-Base 模型上将 pass@8 提升约 9.74% 到 13.13%。
Rollout-GDRO 相较于 GRPO，在相同模型尺度上将 pass@8 提升约 10.64% 到 9.20%。
该框架带来一种新兴的课程效果，资源向不断演化的推理前沿转移。
EMA 去偏评分避免了频率偏差，并保持了多样的活跃难度组。
理论基础将 Prompt-GDRO 与熵正则化的 GDRO 对应起来，并具备无后悔解释。
平方根规律在计算中性预算下为方差最优的滚动分配提供动机。

Figure 2: Conceptual Illustration: Static Uniformity vs. Multi-Adversary GDRO (Dynamic). (Left) Standard GRPO samples prompts uniformly ( $q=1/B$ ) and assigns a fixed number of rollouts (schematically $N=16$ ), causing it to overfit easy tasks while under-exploring the frontier. (Right) Our framewo

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。