QUICK REVIEW

[论文解读] Reusable MLOps: Reusable Deployment, Reusable Infrastructure and Hot-Swappable Machine Learning models and services

Deven Panchal, Priyanka Verma|arXiv (Cornell University)|Feb 19, 2024

Distributed and Parallel Computing Systems被引用 5

一句话总结

这篇论文通过 Acumos AI 平台提出可重复使用的 MLOps，实现可重复部署、可重复基础设施，以及热插拔的 ML 模型，不需拆除基础设施即可持续将模型推向生产。

ABSTRACT

Although Machine Learning model building has become increasingly accessible due to a plethora of tools, libraries and algorithms being available freely, easy operationalization of these models is still a problem. It requires considerable expertise in data engineering, software development, cloud and DevOps. It also requires planning, agreement, and vision of how the model is going to be used by the business applications once it is in production, how it is going to be continuously trained on fresh incoming data, and how and when a newer model would replace an existing model. This leads to developers and data scientists working in silos and making suboptimal decisions. It also leads to wasted time and effort. We introduce the Acumos AI platform we developed and we demonstrate some unique novel capabilities that the Acumos model runner possesses, that can help solve the above problems. We introduce a new sustainable concept in the field of AI/ML operations - called Reusable MLOps - where we reuse the existing deployment and infrastructure to serve new models by hot-swapping them without tearing down the infrastructure or the microservice, thus achieving reusable deployment and operations for AI/ML models while still having continuously trained models in production.

研究动机与目标

在 ML/AI 流水线中的运营化挑战以及数据科学家与开发者之间的隔阂提供动机。
介绍可重复使用的 MLOps 的概念，以重复使用部署和基础设施来服务于多个模型。
展示 Acumos 如何实现生产模型的热交换和持续再训练而不产生停机。
强调治理、许可和市场方面，以促进模型共享与重用。

提出的方法

描述 Acumos 平台及其组件（Model Runner、Java 客户端、Design Studio）用于模型的接入与部署。
解释模型是如何导出（MOJO zip、jar）并通过 Model Runner 封装到一个通用的 Acumos 微服务中。
展示如何使用 Protobuf 序列化实现低时延、语言无关的数据交换以及 JVM 的动态类加载。
详述 Acumos Model Runner 的 API 表面以及端点如何实现模型替换、proto 更新和按需行为更改。
说明部署路径到 Kubernetes、AWS、Azure、GCP，或独立的 Docker 部署，以及跨服务的模型重用。

实验结果

研究问题

RQ1如何在不拆除服务的情况下，在多种 ML 模型之间重复使用模型部署和基础设施？
RQ2Acumos Model Runner 提供了哪些能力来热交换模型或在生产中改变模型行为？
RQ3ML 流水线如何在生产环境中支持持续训练和无缝模型替换？
RQ4在可重复使用的 MLOps 框架中，哪些治理、许可和联合机制支持模型共享与变现？

主要发现

Acumos Model Runner 使在现有微服务中热交换模型成为可能且不影响停机时间。
模型被封装到一个通用的 Acumos 微服务中，并可通过丰富的 API 表面更新或替换。
基于 Protobuf 的序列化支持低延迟的跨语言数据交换和模型制品的动态类加载。
该平台支持将托管模型部署到常见的云和容器运行时，促进可重复部署与基础设施复用。
Design Studio 与 Java 客户端简化了将来自 H2O、Java 或 Spark 的模型接入生产就绪微服务的过程。

Figure 2: A Machine Learning model being onboarded to Acumos

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。