QUICK REVIEW

[論文レビュー] Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards

Hieu Trung Nguyen, Bao Nguyen|arXiv (Cornell University)|Feb 2, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

VIP は、ガウス過程で各プロンプトの成功確率を予測し、計算資源の予算の下でロールアウトを割り当てる凸最適化を解くことで、グループベースの強化学習における検証可能な報酬を伴うサンプリング効率を改善する、分散情報を用いたロールアウト割り当てを導入します。

ABSTRACT

Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce VIP, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, VIP uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that VIP consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks.

研究の動機と目的

検証可能な報酬を伴うグループベースの強化学習における適応的ロールアウト割り当ての必要性を動機づけ、サンプリング効率を改善する。
原理的な方法（VIP）を開発し、各プロンプトの成功確率を予測し、勾配分散を最小化するようにロールアウトを割り当てる。
勾配分散とプロンプト成功確率を結ぶ理論的分析を提供し、推論とツール強化課題で実証的な利得を示す。

提案手法

Dr. GRPO および RLOO における勾配分散を分析し、各プロンプトの分散と成功確率 p の関係を明らかにする。
プロンプト埋め込み上でのGPを導入し、各プロンプトの p を予測し、観測報酬で事後分布を更新する。
総ロールアウト予算の下で予測されたプロンプトごとの分散の和を最小化する凸最適化問題を定式化し、連続緩和と丸めヒューリスティックを用いる。
連続配分について閉形式風の解（Dr. GRPO および RLOO 変種）を提供し、整数割り当てを実現する貪欲丸め手順を提案する。
数学的推論およびツール補強推論課題でVIPを実験的に検証し、均一分配やヒューリスティック割り当てと比較する。

Figure 1: The process starts with an initial belief over prompt success probabilities. At each step $t$ , a mini-batch $\mathcal{B}_{t}$ is selected, and the belief function $m_{t}(\cdot)$ predicts the success probabilities of the prompts in $\mathcal{B}_{t}$ . A budget allocation module assigns rol

実験結果

リサーチクエスチョン

RQ1オンポリシーのロールアウトにおける勾配分散は、GRPO/RLOO設定における各プロンプトの成功確率にどう依存するか？
RQ2GP ベースの各プロンプト成功確率予測器が、固定計算予算の下で勾配分散を最小化するようロールアウト割り当てを導くことができるか？
RQ3適応的割り当ては推論効率と最終性能を、推論ベンチマークおよびツール補強課題で改善するか？

主な発見

VIP は複数のベンチマークにおいて、均一分配やヒューリスティック割り当てよりもサンプリング効率と性能を一貫して向上させる。
AIME スタイルの推論課題で、VIP 効率化手法は Pass@32、Mean@32、Maj@32 指標でモデルと予算を跨いで顕著な改善を示す。
連続割り当ての式は予算を意識した効率的なロールアウト分布を提供し、丸めにより実現可能な整数割り当てを得る。
VIP の利得は小型・基盤モデルでより顕著であり、ベースモデルがロールアウト予算を過不足なく活用していない場合に特に効果があることを示唆する。
ツール補強推論課題にも実証的な結果が及び、取得・検索補強生成における適応割り当ての有利性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。