QUICK REVIEW

[논문 리뷰] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

Qiuyi Qu, Yicheng Sui|arXiv (Cornell University)|2026. 01. 19.

Parallel Computing and Optimization Techniques인용 수 0

한 줄 요약

이 논문은 먼저 매개변수화된 템플릿으로 커널을 재구성한 다음, 하드웨어 제약에 맞춘 안내 탐색으로 성능을 최적화하면서 정확성을 보존하는 두 수준의 GPU 코드 튜너를 제시한다. Astra를 능가하고 SGLang 대비 최대 3.55× 속도향상을 달성한다.

ABSTRACT

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

연구 동기 및 목표

자동화되고 재현 가능하며 하드웨어 제약 하에서의 GPU 커널 최적화 필요성에 대한 동기 부여.
의미를 보존하는 리팩토링과 매개변수화된 탐색을 연결하는 두 수준의 최적화 워크플로우 제안.
정확성을 최대한 유지하면서 제한된 반복적 다에이전트 루프를 통해 성능을 극대화.
SGLang의 실제 CUDA 커널에 대한 접근 방식 평가 및 기준선과의 비교.

제안 방법

커널 최적화를 매개변수화된 템플릿 공간에서의 정확성-제약 속도향상 최대화로 형식화한다.
의미론적 리팩토링과 템플릿 기반 튜닝을 안내하는 네-에이전트 폐쇄 루프 파이프라인(계획, 생성, 튜닝, 테스트)을 도입한다.
의미 수준에서 커널을 매개변수화된 템플릿으로 리팩토링하여 튜닝 가능한 실행 전략을 노출한다.
하드웨어 자원 제약 하에서 가능한 매개변수 공간을 도출하고 장치 내에서 순방향 탐색을 수행하여 런타임을 최소화한다.
기준선과의 정확성 검증, 반복 실행을 통한 성능 측정 및 측정 신호에 따른 계획 업데이트를 수행한다.

실험 결과

연구 질문

RQ1의미를 보존하는 리팩토링을 자원 인지적 매개변수 탐색과 어떻게 통합하여 고성능의 GPU 커널에 도달할 수 있는가?
RQ2템플릿 기반 매개변수화와 탐색이 순수 다중 에이전트 재작성보다 일반화 및 재현성을 개선하는가?
RQ3두 수준의 튜너가 자원 제약하의 형태와 커널에서 어떻게 작동하는가?

주요 결과

Index	SGL (baseline, μs)	Astra (Speedup)	Our method (Speedup)
Kernel-1	199.15	2.89×	3.55×
Kernel-2	163.76	1.06×	1.09×
Kernel-3	45.83	1.95×	2.03×

이 방법은 세 커널에서 SGLang 기준선 대비 1.09×–3.55× 속도향상을 달성; Astra만 사용할 경우 1.06×–2.89×의 속도향상을 달성한다.
우리의 접근 방식은 세 커널 모두에서 Astra보다 우수하며, Kernel-1에서 상대적 이득이 가장 큰 편(약 22.8%)이다.
템플라이제이션은 핵심 실행 자유도를 노출하고 탐색 기반 자동 튜닝과 결합될 때 더 높은 성능 한계를 가능하게 한다.
모양에 걸쳐 일반 구성이 로버스트성과 모양 간 성능을 개선하나 문제 크기와 구조에 따라 이득은 다르게 나타난다.
정확성은 최적화된 모든 커널에 대해 보존되며, 출력은 지정된 허용 오차 이내의 기준선과 일치한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.