Skip to main content
QUICK REVIEW

[论文解读] Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Aozhe Wang, Yuchen Yan|arXiv (Cornell University)|Mar 16, 2026
Software Testing and Debugging Techniques被引用 0
一句话总结

Code-A1 引入 Code LLM 与 Test LLM 的对抗性共进化,通过 Mistake Book 实现白盒对抗性测试生成以实现稳定的强化学习训练,在无需人工注释测试的情况下达到强大的代码生成与测试生成性能。

ABSTRACT

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

研究动机与目标

  • 为代码生成的可验证奖励提供改进的动机,超越稀缺的静态测试集。
  • 提出一个解耦的对抗框架,使 Code LLM 与 Test LLM 拥有相反目标,避免自我勾结。
  • 引入 Mistake Book 来稳定经验回放并跟踪历史失败。
  • 设计综合奖励,以平衡测试有效性与对抗难度。
  • 证明对抗性共进化在测试中达到或超过人工注释测试的 RL,同时产生更好的测试。

提出的方法

  • 解耦的双模型架构:Code LLM 生成候选解,Test LLM 在代码条件下生成具有挑战性的测试。
  • 对抗性回放:对每个候选代码生成多组测试集,对测试进行 Ground-Truth 解决方案的验证,并用验证通过的测试执行代码以获得通过率。
  • Mistake Book:对每个问题的经验回放,存储历史上失败的测试,以稳定奖励并跟踪能力演化。
  • 综合奖励:Code LLM 的奖励将历史测试通过率与新测试通过率相结合以鼓励鲁棒性;Test LLM 的奖励通过加权和在测试有效性与对抗难度之间取得平衡。
  • 通过带有不对称采样和顶方差选择的分组相对策略优化(GRPO)进行策略优化,以平衡训练计算。
  • 训练动态形成类似课程的共进化:更强的代码需要更难的测试,反之亦然,推动双方共同进步。
Figure 1 : Comparison of three training paradigms for code generation. Vanilla GRPO relies on static golden tests with rigid rewards. GRPO with Self-Play unifies code and test generation but must operate in black-box mode to prevent self-collusion. Code-A1 decouples the two tasks into models with op
Figure 1 : Comparison of three training paradigms for code generation. Vanilla GRPO relies on static golden tests with rigid rewards. GRPO with Self-Play unifies code and test generation but must operate in black-box mode to prevent self-collusion. Code-A1 decouples the two tasks into models with op

实验结果

研究问题

  • RQ1如何在不自我勾结的前提下安全地实现白盒对抗性测试生成?
  • RQ2在多代码基准上,Code LLM 与 Test LLM 的对抗性共进化是否能超越基于人工注释测试的 RL?
  • RQ3Mistake Book 是否能稳定训练并在对抗性共进化中防止灾难性忘记?
  • RQ4在平衡测试有效性与对抗难度方面,学习动力学与最终性能的影响如何?
  • RQ5Code-A1 的测试生成质量是否可与传统的黄金测试 RL 方法相匹敌甚至超过?

主要发现

  • Code-A1 在 HumanEval+、MBPP+、BigCodeBench 等代码生成基准上达到或超过基于人工注释测试训练的模型。
  • Code-A1 的 Test LLM 在判别性测试生成能力方面超过了监督微调或自我对战基线。
  • 3B 模型的 Mul 分数为 15.29,超过基础 7B 模型的 14.72,表明对抗性共进化带来的效率提升。
  • Mistake Book 稳定训练、逐步揭示缺陷,历史失败引导奖励计算并防止遗忘。
  • 去除白盒访问、Mistake Book,或对答案即保预测的要求都会降低性能,证实了各组件的贡献。
Figure 2 : Overview of the Code-A1 training framework. The Code LLM generates solutions accessible to the Test LLM for white-box testing. Generated tests are validated and merged with historical tests from the Mistake Book. The Code LLM is rewarded for passing more tests; the Test LLM is rewarded fo
Figure 2 : Overview of the Code-A1 training framework. The Code LLM generates solutions accessible to the Test LLM for white-box testing. Generated tests are validated and merged with historical tests from the Mistake Book. The Code LLM is rewarded for passing more tests; the Test LLM is rewarded fo

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。