QUICK REVIEW

[论文解读] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

Qiuyi Qu, Yicheng Sui|arXiv (Cornell University)|Jan 19, 2026

Parallel Computing and Optimization Techniques被引用 0

一句话总结

本文提出一个两级GPU代码调优器，先将内核重构为可参数化模板，再进行 guided、硬件约束下的搜索以在保持正确性的前提下优化性能，相较于 Astra 有所提升，并在 SGLang 上实现最高达 3.55×加速。

ABSTRACT

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

研究动机与目标

Motivate the need for automated, reproducible GPU kernel optimization under hardware constraints.
Propose a two-level optimization workflow that couples semantics-preserving refactoring with parameterized search.
Ensure correctness while maximizing performance through a constrained, iterative multi-agent loop.
Evaluate the approach on real-world CUDA kernels from SGLang and compare against baselines.

提出的方法

Formalize kernel optimization as correctness-constrained speedup maximization over a parameterized template space.
Introduce a four-agent, closed-loop pipeline (planning, generation, tuning, testing) guiding semantic refactoring and template-based tuning.
Refactor kernels into parameterizable templates at the semantic level to expose tunable execution strategies.
Derive a feasible parameter space under hardware resource constraints and perform on-device, feed-forward search to minimize runtime.
Validate correctness against a baseline, measure performance through repeated runs, and update plans based on measurement signals.

实验结果

研究问题

RQ1How can semantics-preserving refactoring be integrated with resource-aware parameter search to reach high-performance GPU kernels?
RQ2Does template-based parameterization plus search improve generalization and reproducibility over pure multi-agent rewriting?
RQ3How does the two-level tuner perform across shapes and kernels under hardware constraints?

主要发现

Index	SGL (baseline, μs)	Astra (Speedup)	Our method (Speedup)
Kernel-1	199.15	2.89×	3.55×
Kernel-2	163.76	1.06×	1.09×
Kernel-3	45.83	1.95×	2.03×

The method achieves 1.09×–3.55× speedup over SGLang baseline across three kernels; Astra alone achieves 1.06×–2.89×.
Our approach outperforms Astra on all three kernels, with Kernel-1 showing the largest relative gain (about 22.8%).
Templateization exposes key execution degrees of freedom and enables a higher performance ceiling when combined with search-based autotuning.
Across shapes, the general configuration improves robustness and cross-shape performance compared to Astra, though gains vary with problem size and structure.
Correctness is preserved for all optimized kernels, with outputs matching the baseline within the specified tolerance.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。