QUICK REVIEW

[论文解读] Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Donglin Yu|arXiv (Cornell University)|Mar 13, 2026

Parallel Computing and Optimization Techniques被引用 0

一句话总结

论文证明在模态边界进行多模态大模型推理分区可将跨设备传输从 GB 级 KV 缓存降至 MB 级嵌入，从而实现跨层异构服务，并通过 HeteroServe 展示实际成本与吞吐量收益。

ABSTRACT

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O(L * s_ctx)$ bytes (GB-scale KV caches under stage-level disaggregation) to $O(N_v * d)$ bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\$38k) improves Tokens/\$ by 37% over a homogeneous baseline (\$64k) without degrading latency.

研究动机与目标

Motivate the architectural mismatch between vision encoding and language decoding in MLLMs and quantify cross-device transfer costs.
Theoretically establish that modality-boundary partitioning minimizes cross-device data transfer under standard KV caching (Theorem 1).
Develop a phase-aware runtime (HeteroServe) that exploits modality-level partitioning and cross-tier scheduling.
Provide a closed-form cost model to determine when heterogeneous deployment is cost-optimal and validate with real hardware.

提出的方法

Characterize MLLM inference phases and their hardware bottlenecks (vision compute-bound, language memory-bound).
Derive transfer size formulas for stage-level KV vs modality-level embeddings and prove transfer ratio scaling R = D_KV/D_emb.
Propose a cost model for heterogeneous deployment and derive conditions for cost savings (Eq. 7 and Eq. 8).
Design and implement HeteroServe with embedding-only transfer, cross-type work stealing, and CUDA-Graph-accelerated decoding.
Evaluate on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0, comparing throughput, latency, and cost under PCIe vs NVLink.

Figure 1 : (a) Cost saving $\Delta_{\text{cost}}$ (Eq. 8 ) as a function of the vision-to-language time ratio $\rho$ for different price ratios $\gamma$ . The RTX 4090/A100 operating point ( $\gamma{=}0.19$ , $\rho{=}0.63$ ) is marked. (b) Transfer ratio $R$ (Eq. 2 ) across model depths, confirming

实验结果

研究问题

RQ1Does partitioning at the modality boundary minimize cross-device transfer under standard KV caching across MLLMs?
RQ2Can cross-tier heterogeneous deployment offer practical cost and throughput benefits for multimodal inference on commodity interconnects?
RQ3What are the theoretical and empirical gains in transfer efficiency and economic cost when using modality-level disaggregation?
RQ4Is a phase-aware runtime like HeteroServe feasible and beneficial on real hardware with dynamic vision tokens and varied attention mechanisms?

主要发现

Modality-level disaggregation reduces cross-device transfer from O(L) to MB-scale embeddings, with R ratios of 78×–196× in reported architectures (MHA/GQA).
A closed-form cost model predicts 31.4% savings for heterogeneous deployment, with observed savings of 40.6%.
Engine optimizations and HeteroServe yield up to 54% throughput gains over baseline in identical 4×A100 hardware.
Under a fixed budget, a heterogeneous cluster ($38k) delivers 37% more Tokens/$ than a homogeneous baseline ($64k) without latency degradation.
Empirical evaluation on LLaVA-1.5-7B and Qwen2.5-VL demonstrates practical viability of modality-level cross-tier serving with PCIe.

Figure 2 : HeteroServe architecture. Consumer GPUs (RTX 4090) handle vision encoding and transfer lightweight visual embeddings ( ${\sim}4.5$ MB) via PCIe to datacenter GPUs (A100), which perform language generation. When the consumer pool is idle, cross-type work stealing allows consumer GPUs to as

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。