QUICK REVIEW

[論文レビュー] TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

Prabhu Vellaisamy, Shreesh Tripathi|arXiv (Cornell University)|Mar 12, 2026

Software System Performance and Reliability被引用数 0

ひとこと要約

TaxBreak は、ホスト可視 LLM 推論オーバーヘッドをフレームワーク翻訳、CUDA ライブラリ翻訳、カーネル起動コストに分解するトレース駆動法と、ホスト–デバイスのボリュームバランスを診断する Host–Device Balance Index (HDBI) を提示します。

ABSTRACT

Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.

研究の動機と目的

LLM 推論の非効率をホスト側の抽象層と実行フェーズに跨って帰属させる必要性を動機づける。
ホストオーバーヘッドを三つのコンポーネント、フレームワーク翻訳、CUDA ライブラリ翻訳、カーネル起動コストに層別分解する。
CPU–GPU のバウンド性を定量化し最適化の焦点を導く Host–Device Balance Index (HDBI) を導入する。
dense および mixture-of-experts（MoE） workloads に対する NVIDIA H100/H200 プラットフォームで Prefill および Decode を検証する。
aggregate 指標が支配的な最適化対象を覆い隠すことがあり、CPU のパフォーマンスがエンドツーエンドの待機時間へ有意な影響を与え得ることを示す。

提案手法

各カーネルごとにホスト側レイテンシを三項目に分解する：DeltaFT（フレームワーク翻訳）、DeltaCT（CUDA ライブラリ翻訳：ライブラリ介在カーネル用）、DeltaKT（ハードウェア・フロアのカーネル起動コスト）。
二段階パイプラインで測定：Phase 1 完全モデルトレースでカーネルデータベースを構築；Phase 2 null-kernel floor を用いた isolated replay で dispatch および launch overhead を分離。
カーネルをライブラリ介在型（I_lib = 1）またはフレームワークネイティブ型（I_lib = 0）に分類して DeltaCT vs DeltaFT を帰属。
Host–Device Balance Index を計算：HDBI = T_DeviceActive / (T_DeviceActive + T_Orchestration) によりホスト側 vs デバイス側のレジームを示す。
再現カーネルとトレースカーネルのマッチング手順（厳密一致、部分文字列一致、最頻マッチ）とカーネルファミリーの分類を提供。
二つの NVIDIA プラットフォーム（H100 および H200）と dense および MoE workloads の Prefill と Decode を比較する。

実験結果

リサーチクエスチョン

RQ1フレームワーク翻訳、CUDA ライブラリ前端、カーネル起動パスを横断してホスト側 LLM 推論オーバーヘッドをどのように分解できるか。
RQ2Host–Device Balance Index はソフトウェアスタックの最適化対象かデバイス側実行かを信頼して示すか。
RQ3dense と MoE の LLM が Prefill および Decode のカーネル断片化とホスト境界/デバイス境界挙動にどのように異なるか。
RQ4CPU の単一スレッド性能がホストの orchestration およびエンドツーエンドの待機時間に与える影響は。
RQ5フェーズ別測定はカーネル融合、CUDA Graphs、ランタイムコンパイルなどの最適化ターゲットを粗い GPU 利用指標以上に示せるか。

主な発見

TaxBreak はホストオーバーヘッドをフレームワーク翻訳、CUDA ライブラリ翻訳、カーネル起動 floor コストの三段階に分解できる。
MoE モデルは出力トークンあたり dense モデルより 8–11× 多くのカーネルをディスパッチし、活性化パラメータ数が同等でもホストオーバーヘッドを大きくする。
CPU 単一スレッドの高速化はホスト orchestration overhead を 10–29% 減少させ、GPU クロックが遅くてもエンドツーエンド待機時間を最大 14% 改善する。
HDBI は最適化をソフトウェアスタックのオーバーヘッド削減 or デバイス側作業の縮小どちらが適切かを示す有界性の要約を提供し、ホスト境界 vs デバイス境界のレジームを明確化する。
GPT-2 on H200 ではホスト orchestration がバッチサイズにほぼ一定で推移し、デバイス作業が待機時間を増大させることを示し、HDBI が aggregate launch 指標より有用であることを示した。
dense および MoE workloads、 Prefill および Decode 全体を通じて、 aggregate 指標だけでは支配的な最適化ターゲットを見失い得ることを示し、跨るスタック帰属の重要性を強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。