QUICK REVIEW

[论文解读] Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Shyam Sundhar Ramesh, Xiaotong Ji|arXiv (Cornell University)|Feb 5, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

MT-GRPO 引入改进感知的任务再加权与比例保持采样器，在基于 GRPO 的 RL 后训练中实现对多任务的鲁棒、平衡推理，提升最差任务准确率同时保持平均性能的竞争力。

ABSTRACT

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

研究动机与目标

在 RL 后训练过程中促进对多样推理任务的平衡能力。
直接优化最差任务性能，同时保留平均任务性能。
解决 naive 多任务 GRPO 中出现的零梯度提示与任务干扰问题。
引入机制使学习到的任务权重与实际梯度贡献对齐。

提出的方法

两个关键想法：(i) 针对较弱或改进缓慢的任务进行改进感知的再加权；(ii) 保持比率的批次构造机制，使梯度中反映经过调整的任务权重。
将正式目标框定为约束的极小极大问题，以在平均性能与跨任务鲁棒性之间取得平衡（式(4)）及其拉格朗日松弛形式（式(5)）。
更新规则通过交替步骤将策略优化与自适应任务权重耦合在一起：θ 更新使用 z 加权的 GRPO 梯度（式(6)），以及通过改进信号来调整 z 的 ξ 更新（式(7)）。
改进感知权重更新 (IWU) 使用任务改进 I_k^(t) 与任务奖励的组合信号来稳定再加权（子过程 1）。
比例保持 (RP) 采样器在后筛选的批次中强制目标任务比例以匹配学习到的权重，缓解零梯度采样问题（算法 2 及第 5 节讨论）。

Figure 1: GRPO assigns uniform task weights and samples without regard to task difficulty or zero-gradient rates. Consequently, easy tasks (Countdown) dominate while harder tasks (ARC, Zebra) lag, and effective gradient flow is skewed by varying zero-gradient rates ( $\otimes$ marks high zero-gradie

实验结果

研究问题

RQ1一个以鲁棒性为导向的多任务目标是否可以在不牺牲平均性能的前提下提升最差任务性能？
RQ2如何更新任务权重以同时反映当前性能与跨任务改进轨迹？
RQ3在不同任务的零梯度率差异下，如何使批次构造忠实于目标任务比例？
RQ4改进感知再加权与比例保持采样在更大规模任务集合中是否仍然可扩展且对各任务的可靠性有保障？

主要发现

与基线（GRPO、DAPO、SEC-DAPO）相比，MT-GRPO 在实验中对最差任务准确率有持续提升。
在 3 任务场景中，MT-GRPO 相对于标准 GRPO 在最差任务性能上实现 16–28% 的绝对提升，较 DAPO 提升约 6%，且平均准确率具有竞争力。
在 3 任务场景中，MT-GRPO 在训练步骤的大约一半时间内达到 50% 的最差任务准确率，相较基线更快。
改进感知的再加权方案减少了对单一最差任务的权重崩溃，推动优化朝向表现欠佳的任务。
比例保持采样器使实现的批次比例与学习到的任务权重保持对齐，确保来自各任务的有效梯度贡献。
在 9 任务的实验中，较大的 λ 参数加强了最差任务的改进，但可能降低平均性能，体现了可控的权衡。

Figure 2: In strict worst-task optimization ( $\varepsilon=0$ ), task weights rapidly collapse to the current worst task and oscillate as the worst task shifts, resulting in near-zero weighting of Countdown.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。