QUICK REVIEW

[论文解读] Transform-Augmented GRPO Improves Pass@k

Khiem Le, Youssef Mroueh|arXiv (Cornell University)|Jan 30, 2026

Topic Modeling被引用 0

一句话总结

TA-GRPO 在每道题目的语义等价变换变体上进行训练，并在组内聚合奖励，从而降低梯度消失问题并改善 Pass@k，尤其是在较高 k 下的数学与科学推理基准上。

ABSTRACT

Large language models trained via next-token prediction are fundamentally pattern-matchers: sensitive to superficial phrasing variations even when the underlying problem is identical. Group Relative Policy Optimization (GRPO) was designed to improve reasoning, but in fact it worsens this situation through two failure modes: diversity collapse, where training amplifies a single solution strategy while ignoring alternatives of gradient signal, and gradient diminishing, where a large portion of questions yield zero gradients because all rollouts receive identical rewards. We propose TA-GRPO (Transform-Augmented GRPO), which generates semantically equivalent transformed variants of each question (via paraphrasing, variable renaming, and format changes) and computes advantages by pooling rewards across the entire group. This pooled computation ensures mixed rewards even when the original question is too easy or too hard, while training on diverse phrasings promotes multiple solution strategies. We provide theoretical justification showing that TA-GRPO reduces zero-gradient probability and improves generalization via reduced train-test distribution shift. Experiments on mathematical reasoning benchmarks show consistent Pass@k improvements, with gains up to 9.84 points on competition math (AMC12, AIME24) and 5.05 points on out-of-distribution scientific reasoning (GPQA-Diamond).

研究动机与目标

阐明模式匹配在推理任务中的局限性及其对 LLMs 的影响。
应对 GRPO 中的梯度消失与多样性崩溃问题。
引入 TA-GRPO 以在转换后的问题变体之间 pool 出优势。
就降低训练-测试差距与非零梯度提供理论依据。
在数学与科学推理基准上实证验证 Pass@k 的提升。

提出的方法

为每道题引入 N 个保持语义的变换（改述、变量重命名、格式变换）。
将每道题与其变换一起分组，并对该组的优势进行 pooling。
使用 pooled whitening：A = (R - mu_group) / (sigma_group + epsilon) 在所有 (N+1) 个变体和 rollouts 上。
从伯努利方差与 Pinsker-KL 出发，理论上证明减少零梯度概率与得到的泛化界限；推导 pooled objective。
在 Qwen-3.1B 与 Qwen-3.4B 模型上，对 AMC12、AIME24、AIME25、OlympiadBench、Minerva、GPQA-Diamond 进行实证评估。
证明 pool ing 是关键；消融实验显示仅数据增强在没有 pooled 优势的情况下不足以优于 GRPO。

Figure 1 : Percentage of zero-gradient questions throughout training (Qwen3-1.7B). Questions that are “too easy” (all rollouts correct) or “too hard” (all rollouts incorrect) yield zero gradients and contribute nothing to learning: lower is better. TA-GRPO consistently reduces zero-gradient question

实验结果

研究问题

RQ1TA-GRPO 相较于标准 GRPO，在较大 k 的情况下是否提升 Pass@k？
RQ2TA-GRPO 是否对分布外推理任务有更好的泛化？
RQ3变换增强是否降低梯度消失并促成更丰富的解题策略？
RQ4 pooled-advantage 目标在理论上是否有依据且在实践中有益？

主要发现

Model	AMC12	AIME24	AIME25	OlympiadBench	Minerva	GPQA-Diamond
Qwen3-1.7B Base	65.06	30.00	30.00	60.09	48.53	57.58
Qwen3-1.7B GRPO	69.88	41.31	30.00	66.62	50.37	68.69
Qwen3-1.7B TA-GRPO	79.72	50.00	33.33	68.84	52.94	73.74
+9.84	+8.69	+3.33	+2.23	+2.57	+5.05
Qwen3-4B Base	73.49	43.33	33.33	65.88	59.19	78.79
Qwen3-4B GRPO	84.34	60.00	46.67	70.33	59.56	78.79
Qwen3-4B TA-GRPO	87.95	66.67	53.33	75.07	61.03	82.32
+3.62	+6.67	+6.67	+4.75	+1.47	+3.54

TA-GRPO 在多项基准上带来一致的 Pass@k 提升，在 1.7B 模型上 AMC12 提升最高可达 9.84 点，在 GPQA-Diamond 上提升 5.05 点。
对于 4B 模型，在 Pass@32 的基准测试中 TA-GRPO 相比 GRPO 取得最高 3.54 点的提升。
TA-GRPO 在训练阶段将零梯度问题减少了 12–16 个百分点。
消融显示 advantages 的 pool ing 是关键；单独的数据增强在没有 pooled 优势时往往在某些基准上不及 GRPO。
TA-GRPO 对分布外任务的泛化更好，对 GPQA-Diamond 提升为 5.05（1.7B）和 3.53（4B）。
变换变体的多样性维持了多种解题策略，解释了在较高 Pass@k 下更大的收益。

Figure 2 : Pass@k curves across all benchmarks for Qwen3-1.7B (top) and Qwen3-4B (bottom). TA-GRPO (red) consistently outperforms GRPO (blue), with the gap widening as $k$ increases.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。