QUICK REVIEW

[论文解读] AVO: Agentic Variation Operators for Autonomous Evolutionary Search

Terry Chen, Zhifan Ye|arXiv (Cornell University)|Mar 25, 2026

Evolutionary Algorithms and Applications被引用 0

一句话总结

AVO 用自主编码代理取代固定突变，规划、实现并验证内核优化，在 NVIDIA Blackwell GPU 上实现了前沿的注意力性能，并将收益转移到分组查询注意力。

ABSTRACT

Agentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a language model to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today's most advanced GPU hardware.

研究动机与目标

Motivate autonomous optimization of highly tuned kernels beyond fixed mutation pipelines.
Introduce Agentic Variation Operators (AVO) that enable a self-directed agent to plan, implement, and validate kernel edits.
Evaluate AVO on multi-head attention (MHA) kernels on NVIDIA Blackwell GPUs, comparing against cuDNN and FlashAttention-4 (FA4).
Demonstrate transfer of discovered optimizations from MHA to grouped-query attention (GQA).

提出的方法

Formalize Vary as an autonomous agent loop that combines planning, tool use, and persistent memory.
Provide a domain-specific knowledge base and a dual-objective scoring function (correctness and throughput).
Implement AVO as a single autonomous variation step that can consult references, test changes, and revise strategies.
Run continuous, multi-day evolution with a self-supervision mechanism to escape stagnation.
Benchmark against cuDNN and FA4 on forward-pass throughput across multiple sequence lengths and configurations.

实验结果

研究问题

RQ1Can agentic variation operators autonomously discover kernel optimizations that surpass hand-crafted baselines (cuDNN, FA4) on modern GPUs?
RQ2Do optimizations discovered for MHA transfer to GQA with minimal autonomous adaptation?
RQ3What kinds of micro-architectural strategies (e.g., scheduling, register allocation) do autonomous agents converge on for attention kernels?
RQ4How does continuous autonomous evolution compare to fixed pipelines in producing sustained performance gains?

主要发现

Optimization	Versions	Non-causal	Causal
Branchless accumulator rescaling	v19 → v20	+8.1%	+1.6%
Correction/MMA pipeline overlap	v29 → v30	+1.1%	+0.4%
Register rebalancing across warp groups	v32 → v33	+2.1%	~0%

AVO 产生的 MHA 内核在 BF16 下达到最高 1668 TFLOPS，较 cuDNN 高出最多 3.5%，较 FA4 高出最多 10.5%。
离散的演化步骤在架构转折点带来较大增益，后续阶段出现较小的累积改进。
将演化出的 MHA 优化转移到 GQA 只需大约 30 分钟的自主适应，较 cuDNN 高出最多 7.0%，较 FA4 高出最多 9.3%。
自主优化覆盖寄存器分配、指令调度与工作负载分配，表明是对硬件层面的推理而非表面编辑。
七天、40 次提交的演化显示持续进展，具有有意义的跃升，后续回报递减。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。