QUICK REVIEW

[논문 리뷰] Who Needs MLOps: What Data Scientists Seek to Accomplish and How Can MLOps Help?

Sasu Mäkinen, Henrik Skogström|arXiv (Cornell University)|2021. 03. 16.

Semantic Web and Ontologies참고 문헌 10인용 수 160

한 줄 요약

이 논문은 63개국의 331명의 ML 전문가를 대상으로 데이터 사이언티스트의 활동이 MLOps와 어떤 관련이 있는지 평가한다. 데이터 이슈가 초기 단계를 지배하는 반면, 잦은 재학습과 생산 배포 모델을 갖춘 조직에서 MLOps의 이점이 나타난다.

ABSTRACT

Following continuous software engineering practices, there has been an increasing interest in rapid deployment of machine learning (ML) features, called MLOps. In this paper, we study the importance of MLOps in the context of data scientists' daily activities, based on a survey where we collected responses from 331 professionals from 63 different countries in ML domain, indicating on what they were working on in the last three months. Based on the results, up to 40% respondents say that they work with both models and infrastructure; the majority of the work revolves around relational and time series data; and the largest categories of problems to be solved are predictive analysis, time series data, and computer vision. The biggest perceived problems revolve around data, although there is some awareness of problems related to deploying models to production and related procedures. To hypothesise, we believe that organisations represented in the survey can be divided to three categories -- (i) figuring out how to best use data; (ii) focusing on building the first models and getting them to production; and (iii) managing several models, their versions and training datasets, as well as retraining and frequent deployment of retrained models. In the results, the majority of respondents are in category (i) or (ii), focusing on data and models; however the benefits of MLOps only emerge in category (iii) when there is a need for frequent retraining and redeployment. Hence, setting up an MLOps pipeline is a natural step to take, when an organization takes the step from ML as a proof-of-concept to ML as a part of nominal activities.

연구 동기 및 목표

현실 세계의 조직에서 데이터 과학자의 일상 활동에서의 MLOps 역할을 평가한다.
ML 성숙도 단계를 특징짓고 그것이 MLOps 도입과 어떻게 매핑되는지 설명한다.
ML 운영과 가장 관련이 높은 데이터 유형, 문제 및 배포 관행을 식별한다.

제안 방법

63개국의 331명의 ML 전문가를 대상으로 온라인 설문조사(State of ML 2020)를 수행한다.
현재 작업 초점, 데이터 유형, 해결된 문제, 단기 목표(3개월)에 대해 묻는다.
인지된 장애물을 평가하고 조직을 ML 성숙도에 따라 데이터 중심, 모델 중심, 파이프라인 중심으로 분류한다.
ML 성숙도와 MLOps 필요성 간의 상관관계를 분석한다.
콘셉트 증명에서 생산 등급의 ML 운영으로의 전환에 대한 통찰을 제시한다.

실험 결과

연구 질문

RQ1데이터 과학자들이 일상적인 ML 활동에서 무엇을 다루는가?
RQ2데이터 유형과 문제 영역이 MLOps 필요성과 어떻게 관련되는가?
RQ3조직 간의 ML 성숙도 차이는 어떠하며, 이것이 MLOps 도입과 이점에 어떻게 영향을 미치는가?
RQ4주요 장애물은 무엇이며, 언제 MLOps가 가장 큰 가치를 제공하는가?

주요 결과

응답자의 최대 40%가 모델과 인프라 모두를 다룬다.
대부분의 작업은 관계형 데이터 및 시계열 데이터를 중심으로 한다.
가장 큰 문제 범주는 예측 분석, 시계열 데이터, 컴퓨터 비전이다.
데이터 관련 이슈(혼돈, 부족, 접근성)가 가장 큰 도전이며, 배포 관련 이슈는 성숙하고 파이프라인 중심의 맥락을 제외하면 두드러지지 않다.
대부분의 응답자는 데이터 중심 또는 모델 중심 단계에 있으며, 파이프라인 중심(잦은 재학습 및 배포)은 덜 보이나 MLOps가 명확한 가치를 더하는 영역이다.
개척적이고 파이프라인 중심인 조직은 규모가 큰 편으로, 인프라와 모델링 문제를 모두 다루는 팀을 포함하는 경향이 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.