QUICK REVIEW

[論文レビュー] Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Shyam Sundhar Ramesh, Xiaotong Ji|arXiv (Cornell University)|Feb 5, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

MT-GRPOは改善意識型タスクリウェイティングと比率保持サンプラーを導入し、GRPOベースのRL後学習中に複数タスク間でロバストかつバランスの取れた推論を実現。最悪タスクの精度でベースラインを上回りつつ、平均性能を維持。

ABSTRACT

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

研究の動機と目的

RL後学習中に多様な推論タスク間で均衡の取れた能力を促進する。
平均的なパフォーマンスを維持しつつ最悪タスクの性能を直接最適化する。
素朴な多タスクGRPOで生じるゼロ勾配プロンプトとタスク干渉に対処する。
学習したタスクウェイトを実際の勾配寄与と整合させるメカニズムを導入する。

提案手法

2つの主要なアイデア: (i) 弱いまたはゆっくり改善するタスクを優先する改善意識型タスクリウェイトで、(ii) 学習したタスクウェイトを勾配に反映する比率保持バッチ構築機構。
制約付き最大最小問題として平均性能とタスク間のロバスト性をバランスさせる形式的目的関数（式(4)）とそのラグランジュ緩和（式(5)）。
方策最適化を適応的タスクウェイトと結びつける更新ルール：z重み付きGRPO勾配を用いたθ更新（式(6)）と改善シグナルによりzを調整するξ更新（式(7)）。
改善意識型ウェイト更新（IWU）は、タスク改善I_k^(t)とタスク報酬の組み合わせ信号を用いてリウェイトを安定させる（サブルーチン1）。
比率保持（RP）サンプラーは、学習された重みに合わせて後処理フィルタ後のバッチのターゲットタスク比を強制し、ゼロ勾配サンプリング問題を緩和する（アルゴリズム2およびセクション5の議論）。

Figure 1: GRPO assigns uniform task weights and samples without regard to task difficulty or zero-gradient rates. Consequently, easy tasks (Countdown) dominate while harder tasks (ARC, Zebra) lag, and effective gradient flow is skewed by varying zero-gradient rates ( $\otimes$ marks high zero-gradie

実験結果

リサーチクエスチョン

RQ1頑健性を意識した多タスク目的が、平均性能を損なうことなく最悪タスクの性能を改善できるか？
RQ2現在の性能と改善傾向の両方を反映してタスクウェイトをどのように更新すべきか？
RQ3タスク間のゼロ勾配率の差異を考慮して、バッチ構成をターゲットタスク比に忠実にするにはどうすべきか？
RQ4改善意識型リウェイティングと比率保持サンプリングは、より大きなタスクセットに拡張して信頼性を維持できるか？

主な発見

MT-GRPOはベースライン（GRPO、DAPO、SEC-DAPO）に対して実験全体で最悪タスクの精度を一貫して改善。
3タスク設定では、MT-GRPOは標準GRPOより最悪タスク性能を絶対値で16–28％、DAPOより6％改善し、平均精度は競合的。
3タスク設定でMT-GRPOは最悪タスク精度を学習ステップの約半分で50%に達成。
改善意識型リウェイト更新は、単一の最悪タスクへのウェイト崩壊を抑制し、低パフォーマンスタスクへの最適化を促進。
比率保持サンプラーは実際のバッチ比を学習したタスクウェイトと整合させ、各タスクからの有効な勾配寄与を保証。
9タスクでの実験は、より大きなλが最悪タスクの改善を強化する一方、平均性能が低下する可能性を示し、制御可能なトレードオフを明示。

Figure 2: In strict worst-task optimization ( $\varepsilon=0$ ), task weights rapidly collapse to the current worst task and oscillate as the worst task shifts, resulting in near-zero weighting of Countdown.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。