Skip to main content
QUICK REVIEW

[论文解读] Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Donglin Yu|arXiv (Cornell University)|Mar 13, 2026
Parallel Computing and Optimization Techniques被引用 0
一句话总结

论文证明在模态边界进行多模态大模型推理分区可将跨设备传输从 GB 级 KV 缓存降至 MB 级嵌入,从而实现跨层异构服务,并通过 HeteroServe 展示实际成本与吞吐量收益。

ABSTRACT

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O(L * s_ctx)$ bytes (GB-scale KV caches under stage-level disaggregation) to $O(N_v * d)$ bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\$38k) improves Tokens/\$ by 37% over a homogeneous baseline (\$64k) without degrading latency.

研究动机与目标

  • Motivate the architectural mismatch between vision encoding and language decoding in MLLMs and quantify cross-device transfer costs.
  • Theoretically establish that modality-boundary partitioning minimizes cross-device data transfer under standard KV caching (Theorem 1).
  • Develop a phase-aware runtime (HeteroServe) that exploits modality-level partitioning and cross-tier scheduling.
  • Provide a closed-form cost model to determine when heterogeneous deployment is cost-optimal and validate with real hardware.

提出的方法

  • Characterize MLLM inference phases and their hardware bottlenecks (vision compute-bound, language memory-bound).
  • Derive transfer size formulas for stage-level KV vs modality-level embeddings and prove transfer ratio scaling R = D_KV/D_emb.
  • Propose a cost model for heterogeneous deployment and derive conditions for cost savings (Eq. 7 and Eq. 8).
  • Design and implement HeteroServe with embedding-only transfer, cross-type work stealing, and CUDA-Graph-accelerated decoding.
  • Evaluate on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0, comparing throughput, latency, and cost under PCIe vs NVLink.
Figure 1 : (a) Cost saving $\Delta_{\text{cost}}$ (Eq. 8 ) as a function of the vision-to-language time ratio $\rho$ for different price ratios $\gamma$ . The RTX 4090/A100 operating point ( $\gamma{=}0.19$ , $\rho{=}0.63$ ) is marked. (b) Transfer ratio $R$ (Eq. 2 ) across model depths, confirming
Figure 1 : (a) Cost saving $\Delta_{\text{cost}}$ (Eq. 8 ) as a function of the vision-to-language time ratio $\rho$ for different price ratios $\gamma$ . The RTX 4090/A100 operating point ( $\gamma{=}0.19$ , $\rho{=}0.63$ ) is marked. (b) Transfer ratio $R$ (Eq. 2 ) across model depths, confirming

实验结果

研究问题

  • RQ1Does partitioning at the modality boundary minimize cross-device transfer under standard KV caching across MLLMs?
  • RQ2Can cross-tier heterogeneous deployment offer practical cost and throughput benefits for multimodal inference on commodity interconnects?
  • RQ3What are the theoretical and empirical gains in transfer efficiency and economic cost when using modality-level disaggregation?
  • RQ4Is a phase-aware runtime like HeteroServe feasible and beneficial on real hardware with dynamic vision tokens and varied attention mechanisms?

主要发现

  • Modality-level disaggregation reduces cross-device transfer from O(L) to MB-scale embeddings, with R ratios of 78×–196× in reported architectures (MHA/GQA).
  • A closed-form cost model predicts 31.4% savings for heterogeneous deployment, with observed savings of 40.6%.
  • Engine optimizations and HeteroServe yield up to 54% throughput gains over baseline in identical 4×A100 hardware.
  • Under a fixed budget, a heterogeneous cluster ($38k) delivers 37% more Tokens/$ than a homogeneous baseline ($64k) without latency degradation.
  • Empirical evaluation on LLaVA-1.5-7B and Qwen2.5-VL demonstrates practical viability of modality-level cross-tier serving with PCIe.
Figure 2 : HeteroServe architecture. Consumer GPUs (RTX 4090) handle vision encoding and transfer lightweight visual embeddings ( ${\sim}4.5$ MB) via PCIe to datacenter GPUs (A100), which perform language generation. When the consumer pool is idle, cross-type work stealing allows consumer GPUs to as
Figure 2 : HeteroServe architecture. Consumer GPUs (RTX 4090) handle vision encoding and transfer lightweight visual embeddings ( ${\sim}4.5$ MB) via PCIe to datacenter GPUs (A100), which perform language generation. When the consumer pool is idle, cross-type work stealing allows consumer GPUs to as

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。