QUICK REVIEW

[论文解读] Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Aozhe Wang, Yuchen Yan|arXiv (Cornell University)|Mar 16, 2026

Software Testing and Debugging Techniques被引用 0

一句话总结

Code-A1 引入 Code LLM 与 Test LLM 的对抗性共进化，通过 Mistake Book 实现白盒对抗性测试生成以实现稳定的强化学习训练，在无需人工注释测试的情况下达到强大的代码生成与测试生成性能。

ABSTRACT

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

研究动机与目标

为代码生成的可验证奖励提供改进的动机，超越稀缺的静态测试集。
提出一个解耦的对抗框架，使 Code LLM 与 Test LLM 拥有相反目标，避免自我勾结。
引入 Mistake Book 来稳定经验回放并跟踪历史失败。
设计综合奖励，以平衡测试有效性与对抗难度。
证明对抗性共进化在测试中达到或超过人工注释测试的 RL，同时产生更好的测试。

提出的方法

解耦的双模型架构：Code LLM 生成候选解，Test LLM 在代码条件下生成具有挑战性的测试。
对抗性回放：对每个候选代码生成多组测试集，对测试进行 Ground-Truth 解决方案的验证，并用验证通过的测试执行代码以获得通过率。
Mistake Book：对每个问题的经验回放，存储历史上失败的测试，以稳定奖励并跟踪能力演化。
综合奖励：Code LLM 的奖励将历史测试通过率与新测试通过率相结合以鼓励鲁棒性；Test LLM 的奖励通过加权和在测试有效性与对抗难度之间取得平衡。
通过带有不对称采样和顶方差选择的分组相对策略优化（GRPO）进行策略优化，以平衡训练计算。
训练动态形成类似课程的共进化：更强的代码需要更难的测试，反之亦然，推动双方共同进步。

Figure 1 : Comparison of three training paradigms for code generation. Vanilla GRPO relies on static golden tests with rigid rewards. GRPO with Self-Play unifies code and test generation but must operate in black-box mode to prevent self-collusion. Code-A1 decouples the two tasks into models with op

实验结果

研究问题

RQ1如何在不自我勾结的前提下安全地实现白盒对抗性测试生成？
RQ2在多代码基准上，Code LLM 与 Test LLM 的对抗性共进化是否能超越基于人工注释测试的 RL？
RQ3Mistake Book 是否能稳定训练并在对抗性共进化中防止灾难性忘记？
RQ4在平衡测试有效性与对抗难度方面，学习动力学与最终性能的影响如何？
RQ5Code-A1 的测试生成质量是否可与传统的黄金测试 RL 方法相匹敌甚至超过？

主要发现

Code-A1 在 HumanEval+、MBPP+、BigCodeBench 等代码生成基准上达到或超过基于人工注释测试训练的模型。
Code-A1 的 Test LLM 在判别性测试生成能力方面超过了监督微调或自我对战基线。
3B 模型的 Mul 分数为 15.29，超过基础 7B 模型的 14.72，表明对抗性共进化带来的效率提升。
Mistake Book 稳定训练、逐步揭示缺陷，历史失败引导奖励计算并防止遗忘。
去除白盒访问、Mistake Book，或对答案即保预测的要求都会降低性能，证实了各组件的贡献。

Figure 2 : Overview of the Code-A1 training framework. The Code LLM generates solutions accessible to the Test LLM for white-box testing. Generated tests are validated and merged with historical tests from the Mistake Book. The Code LLM is rewarded for passing more tests; the Test LLM is rewarded fo

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。