QUICK REVIEW

[論文レビュー] KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Qitong Sun, Jun Han|arXiv (Cornell University)|Mar 10, 2026

Advanced Neural Network Applications被引用数 0

ひとこと要約

KernelSkill は、プロファイリングのフィードバックに導かれた高精度・高速なカーネル最適化を実現する、専門家の GPU カーネル最適化知識とタスクごとの refinement 履歴を外部化するメモリ拡張型のマルチエージェントシステムです。

ABSTRACT

Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial-and-error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge-driven and aware of task trajectories. Specifically, we present KernelSkill, a multi-agent framework with a dual-level memory architecture. KernelSkill operates by coordinating agents with long-term memory of reusable expert skills and short-term memory to prevent repetitive backtracking. On KernelBench Levels 1-3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at https://github.com/0satan0/KernelMem/.

研究の動機と目的

専門家主導で解釈可能な GPU カーネル最適化が、プロファイリング駆動のボトルネックの中で必要とされていることを動機づける。
KernelSkill を、安定・追跡可能なカーネル refinement を実現するための、長期メモリと短期メモリを備えた二階層のメモリ・マルチエージェントフレームワークとして紹介する。
KernelSkill が、ベースラインと比較して KernelBench レベル 1–3 で高い成功率と優れたスピードアップを達成することを実証する。
長期メモリが専門的な最適化知識を蒸留し、短期メモリがタスクごとの refinement を安定化させる方法を示す。

提案手法

Generator、Reviewer、Feature Extractor、Planner、Optimizer、Repairer を含む閉ループ型のマルチエージェント refinement パイプラインを提案。
長期メモリ（決定方針と方法知識を含む）と、タスクごとの最適化/修復 Trajectories を含む短期メモリという二階層のメモリシステムを実装。
静的コード特徴とプロファイリング信号を用いて長期メモリから候補となる最適化手法を取得し、具体的な段階的編集を計画。
修復と最適化のパスを分離し、Diagnoser と Repairer を活用してコンパイル/正確性の障害とその後の refine を処理。
過去の計画・編集・成果をタスクごとに短期メモリとして保持し、多回の refinement を安定化させ振動を防止。
seed カーネル生成、正確性チェック、プロファイリング（nsys/nsight compute）を用いて、最大 N ラウンドにわたる反復的改善を推進。
すべての最適化アクションは環境制約下で実装され、コンパイル可能で正確なカーネルを保証する Optimizer/Repairer によって実行。

実験結果

リサーチクエスチョン

RQ1メモリ拡張型のマルチエージェントフレームワークは、KernelBench レベル 1–3 において、成功率とスピードアップの点でベースラインの LLM ベース最適化より優れているのか？
RQ2長期的な最適化知識とタスク別の Trajectory memory を分離することは、解釈性・安定性・再利用性を改善するのか？
RQ3静的コード特徴とプロファイリング信号は、頑健で検証可能な最適化手法選択をどう導くのか？
RQ4二階層の memory（長期と短期）が、プロファイリングフィードバック下で refinement の効率と収束に与える影響はどの程度か？

主な発見

Method	Level 1 Success	Level 1 Speedup	Level 2 Success	Level 2 Speedup	Level 3 Success	Level 3 Speedup
Kevin-32B	0.83	1.18	0.92	1.74	0.46	0.32
Astra	0.95	1.48	0.98	0.99	0.93	0.90
PRAGMA	0.95	1.49	0.98	1.02	0.94	0.92
CudaForge	0.96	1.45	1.00	2.10	0.96	1.28
QiMeng	1.00	2.20	0.99	1.22	0.70	0.73
STARK	1.00	3.03	1.00	2.69	1.00	1.58
KernelSkill	1.00	5.44	1.00	2.82	1.00	1.92

KernelSkill は KernelBench レベル 1–3 通しで 100% の成功率を達成。
Torch Eager に対する平均スピードアップは、レベル 1: 5.44×、レベル 2: 2.82×、レベル 3: 1.92×。
KernelSkill は、強力な profiling 基盤のベースライン（Astra、PRAGMA、CudaForge）および訓練ベースの手法（QiMeng、Kevin）をレベルを問わず上回る。
長期メモリは検証可能な手法意思決定と再利用可能な最適化知識を可能にし、短期メモリはタスクごとの refinement を安定化し振動を減少させる。
アブレーション研究では、どちらか一方の memory を削除すると性能が低下することが示され、両方の memory レベルの価値が強調される。

Figure 2 : The short-term memory for the current repair round.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。