[论文解读] Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity
论文证明在模态边界进行多模态大模型推理分区可将跨设备传输从 GB 级 KV 缓存降至 MB 级嵌入,从而实现跨层异构服务,并通过 HeteroServe 展示实际成本与吞吐量收益。
Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O(L * s_ctx)$ bytes (GB-scale KV caches under stage-level disaggregation) to $O(N_v * d)$ bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\$38k) improves Tokens/\$ by 37% over a homogeneous baseline (\$64k) without degrading latency.
研究动机与目标
- Motivate the architectural mismatch between vision encoding and language decoding in MLLMs and quantify cross-device transfer costs.
- Theoretically establish that modality-boundary partitioning minimizes cross-device data transfer under standard KV caching (Theorem 1).
- Develop a phase-aware runtime (HeteroServe) that exploits modality-level partitioning and cross-tier scheduling.
- Provide a closed-form cost model to determine when heterogeneous deployment is cost-optimal and validate with real hardware.
提出的方法
- Characterize MLLM inference phases and their hardware bottlenecks (vision compute-bound, language memory-bound).
- Derive transfer size formulas for stage-level KV vs modality-level embeddings and prove transfer ratio scaling R = D_KV/D_emb.
- Propose a cost model for heterogeneous deployment and derive conditions for cost savings (Eq. 7 and Eq. 8).
- Design and implement HeteroServe with embedding-only transfer, cross-type work stealing, and CUDA-Graph-accelerated decoding.
- Evaluate on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0, comparing throughput, latency, and cost under PCIe vs NVLink.

实验结果
研究问题
- RQ1Does partitioning at the modality boundary minimize cross-device transfer under standard KV caching across MLLMs?
- RQ2Can cross-tier heterogeneous deployment offer practical cost and throughput benefits for multimodal inference on commodity interconnects?
- RQ3What are the theoretical and empirical gains in transfer efficiency and economic cost when using modality-level disaggregation?
- RQ4Is a phase-aware runtime like HeteroServe feasible and beneficial on real hardware with dynamic vision tokens and varied attention mechanisms?
主要发现
- Modality-level disaggregation reduces cross-device transfer from O(L) to MB-scale embeddings, with R ratios of 78×–196× in reported architectures (MHA/GQA).
- A closed-form cost model predicts 31.4% savings for heterogeneous deployment, with observed savings of 40.6%.
- Engine optimizations and HeteroServe yield up to 54% throughput gains over baseline in identical 4×A100 hardware.
- Under a fixed budget, a heterogeneous cluster ($38k) delivers 37% more Tokens/$ than a homogeneous baseline ($64k) without latency degradation.
- Empirical evaluation on LLaVA-1.5-7B and Qwen2.5-VL demonstrates practical viability of modality-level cross-tier serving with PCIe.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。