QUICK REVIEW

[论文解读] INFaaS: A Model-less Inference Serving System

Francisco Romero, Qian Li|arXiv (Cornell University)|May 30, 2019

Advanced Neural Network Applications参考文献 15被引用 7

一句话总结

INFaaS 是一种无模型的推理即服务（inference-as-a-service）系统，可自动处理机器学习推理工作负载的资源与配置决策。通过根据用户指定的性能与精度要求，动态选择最优的模型变体、硬件及扩展策略，INFaaS 相较于 Clipper 和 TensorFlow Serving 实现了高达 150 倍的成本降低、1.5 倍更高的吞吐量，以及 1.5 倍更少的延迟违规次数。

ABSTRACT

Despite existing work in machine learning inference serving, ease-of-use and cost efficiency remain key challenges. Developers must manually match the performance, accuracy, and cost constraints of their applications to decisions about selecting the right model and model optimizations, suitable hardware architectures, and auto-scaling configurations. These interacting decisions are difficult to make for users, especially when the application load varies, applications evolve, and the available resources vary over time. Thus, users often end up making decisions that overprovision resources. This paper introduces INFaaS, a model-less inference-as-a-service system that relieves users of making these decisions. INFaaS provides a simple interface allowing users to specify their inference task, and performance and accuracy requirements. To implement this interface, INFaaS generates and leverages model-variants, versions of a model that differ in resource footprints, latencies, costs, and accuracies. Based on the characteristics of the model-variants, INFaaS automatically navigates the decision space on behalf of users to meet user-specified objectives: (a) it selects a model, hardware architecture, and any compiler optimizations, and (b) it makes scaling and resource allocation decisions. By sharing models across users and hardware resources across models, INFaaS achieves up to 150x cost savings, 1.5x higher throughput, and violates latency objectives 1.5x less frequently, compared to Clipper and TensorFlow Serving.

研究动机与目标

解决在机器学习推理服务中手动且易出错的模型、硬件及自动扩展配置问题。
减少因动态工作负载中复杂且相互依赖的决策而导致的资源过度配置。
使用户仅需指定其推理任务及期望的性能/精度约束。
自动探索模型选择、硬件、编译器优化与扩展策略的决策空间。
通过跨用户共享模型及跨模型共享硬件资源，显著降低资源成本并提升性能。

提出的方法

生成模型变体——同一模型的不同版本，具有不同的资源占用、延迟、成本与精度。
使用集中式系统分析模型变体的特性，并将其映射到硬件与优化配置。
根据用户指定的目标，自动选择最优的模型、硬件与编译器优化组合。
根据工作负载变化动态管理自动扩展与资源分配决策。
通过跨用户共享模型及跨模型共享硬件资源，提升资源利用率并降低总成本。
与现有推理服务堆栈集成，提供简单接口，无需用户进行模型特定的配置。

实验结果

研究问题

RQ1如何自动化模型选择、硬件、编译器优化与扩展策略的决策空间，以减少机器学习推理服务中的手动工作量？
RQ2在共享推理服务环境中，模型变体在多大程度上能提升成本效率与性能？
RQ3与 Clipper 和 TensorFlow Serving 等现有系统相比，自动配置选择能否减少延迟违规与资源过度配置？
RQ4在多个用户与工作负载之间共享模型与硬件时，性能与成本之间存在何种权衡？
RQ5在动态工作负载与不断变化的用户需求下，系统如何维持精度与延迟的保障？

主要发现

通过跨用户共享模型与硬件资源，INFaaS 相较于 Clipper 和 TensorFlow Serving 实现了高达 150 倍的成本节省。
由于更优的资源利用率与配置策略，INFaaS 相较基线系统实现了 1.5 倍的吞吐量提升。
INFaaS 的延迟目标违规频率相比 Clipper 和 TensorFlow Serving 减少了 1.5 倍。
通过自动化选择最优的模型变体与配置，系统有效减少了资源过度配置。
模型变体生成与运行时决策机制在无需用户掌握底层优化知识的前提下，显著提升了性能与成本效益。
系统通过动态适应资源与工作负载条件的变化，在多样化工作负载下仍能保持高精度与低延迟。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。