QUICK REVIEW

[论文解读] TensorFlow-Serving: Flexible, High-Performance ML Serving

Christopher Olston, Noah Fiedel|arXiv (Cornell University)|Dec 17, 2017

Machine Learning and Data Classification参考文献 7被引用 95

一句话总结

TensorFlow-Serving 提供一个灵活的、高性能的 ML 服务框架，包含一个库、一个规范二进制，以及一个托管服务（TFS 2）用于在 Google 与 Google Cloud 中实现高效的模型生命周期管理、批处理和多模型托管。

ABSTRACT

We describe TensorFlow-Serving, a system to serve machine learning models inside Google which is also available in the cloud and via open-source. It is extremely flexible in terms of the types of ML platforms it supports, and ways to integrate with systems that convey new models and updated versions from training to serving. At the same time, the core code paths around model lookup and inference have been carefully optimized to avoid performance pitfalls observed in naive implementations. Google uses it in many production deployments, including a multi-tenant model hosting service called TFS^2.

研究动机与目标

促使对生产就绪的 ML 模型服务基础设施的需求成为现实的动机。
描述支持多种 ML 平台和模型生命周期的体系结构设计。
提出安全的模型升级、canary 与回滚，以及高效的内存管理机制。
解释托管服务（TFS 2）及其如何自动化模型部署与路由。

提出的方法

描述三层设计：C++ 库、规范服务器二进制，以及托管服务。
使用 Sources、Source Routers、Source Adapters 以及 Manager，实现带有愿望版本 API 的模型生命周期管理。
引入 AspiredVersionsManager，具可用性-或资源保留的转换策略与尾部延迟优化。
提供多种推理 API，包括底层张量接口和基于 tf.Example 的高级 API，并为调试和质量检查提供日志记录。
开发跨请求批处理，核心批处理库支持多队列和模型/版本的动态服务。
提供规范的二进制部署和托管服务（TFS 2），以便更易使用并强制执行最佳实践。

实验结果

研究问题

RQ1如何设计一个对底层 ML 框架无关的一般 ML 模型服务系统？
RQ2在最小化延迟并安全地进行 canary 测试的同时，模型版本应如何加载、切换与回滚？
RQ3哪些批处理与线程策略能够在保持尾部延迟较低的同时实现高吞吐量的 GPU/TPU 推理？
RQ4如何将服务基础设施作为托管服务提供，并实现自动化路由与资源管理？
RQ5有哪些机制确保在提供新版本之前对端到端 ML 流水线进行质量检查？

主要发现

库、二进制和托管服务覆盖多种部署需求，包括多模型和多租户环境。
AspiredVersionsManager 启用 canary 和回滚工作流，以安全地在全面推出前验证新版本。
TensorFlow-Serving 可以在 RPC/TensorFlow 层未计入测量时实现约 100,000 请求/秒/每核心的容量。
Google 规模的采用包括数百个项目和每秒数千万次的推理，覆盖多个用户。
TFS 2 自动化将模型分配给 serving 作业、canary/回滚，并依赖 Spanner 用于全局状态，以及带有 hedged 请求以缓解延迟尖峰。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。