QUICK REVIEW

[논문 리뷰] PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Zheming Yang, Yuanhao Yang|arXiv (Cornell University)|2024. 05. 23.

Privacy-Preserving Technologies in Data인용 수 6

한 줄 요약

PerLLM은 동적 자원 조건에서 에지-클라우드 추론 스케줄링 프레임워크를 제안하며, 제약 만족 상한 신뢰구간(CS-UCB) 접근법을 사용해 LLM 서비스 처리량을 최적화하고 에너지를 최소화한다.

ABSTRACT

With the rapid growth in the number of large language model (LLM) users, it is difficult for bandwidth-constrained cloud servers to simultaneously process massive LLM services in real-time. Recently, edge-cloud infrastructures have been used to improve the processing efficiency of large-scale LLM services. However, the diversity of task requirements and the dynamics of resources pose great challenges to inference scheduling, leading to the wastage of many resources. In this paper, we present PerLLM, a personalized inference scheduling framework with edge-cloud collaboration designed for diverse LLM services. For the complexity of multiple constraints and the decision-making process of edge-cloud collaboration, we integrate the upper confidence bound algorithm based on the constraint satisfaction mechanism in PerLLM. For diverse LLM services, PerLLM can optimize service scheduling and resource allocation solutions within the edge-cloud infrastructure to meet processing time requirements while minimizing energy costs. Experimental results from different model deployments show that PerLLM can effectively meet the processing time requirements of personalized services. Compared to other methods, PerLLM achieves 2.2x, 2.1x, and 1.6x throughput and reduces the energy cost by more than 50%.

연구 동기 및 목표

대역폭이 제약된 클라우드 서버에서 효율적인 실시간 LLM 추론을 촉진한다.
다양한 LLM 작업에 대해 빠른 응답과 추론 품질의 균형을 맞추기 위해 에지-클라우드 협업을 활용한다.
동적 자원과 서비스 요건에 적응하는 스케줄링 및 자원 할당 프레임워크를 개발한다.
제약 만족을 가진 조합형 다팔 밴딧으로 문제를 형식화한다.
학습 기반의 제약 인지 스케줄링에 대한 알고리즘과 이론적 분석을 제공한다.

제안 방법

여러 서비스들을 서버(에지 또는 클라우드)에 할당하기 위해 문제를 CMAB(Combinatorial CMAB)로 공식화한다.
처리 시간, 대역폭, 계산 능력 제약을 최적화에 인코딩하기 위한 제약 만족 메커니즘을 도입한다.
가능한 행동을 필터링하고 상한 신뢰구간(UCB) 값이 가장 높은 행동을 선택하는 CS-UCB 알고리즘을 정의하여 탐색과 활용의 균형을 맞춘다.
제약 위반에 패널티 항을 통해 벌점을 부여하고 전송, 추론, 대기 등 에너지 비용을 반영하는 보상 함수를 도입한다.
제약 위반과 동적 환경 하에서 CS-UCB의 이론적 후회와 복잡도 분석을 제공한다.

실험 결과

연구 질문

RQ1동적 자원 조건에서 다양한 LLM 서비스 요건을 충족하기 위해 에지-클라우드 협업을 어떻게 최적화할 수 있는가?
RQ2제약 인지 CMAB(CS-UCB) 프레임워크가 서비스 스케줄링과 자원 할당을 효과적으로 수행해 처리량을 최대화하고 에너지를 최소화할 수 있는가?
RQ3이 제약된 CMAB 설정에서 CS-UCB에 대한 이론적 보장(후회 상한)은 무엇인가?
RQ4다양한 대역폭에서 Baselines와 비교했을 때 Processing time 만족도, 처리량, 에너지 비용 측면에서 PerLLM의 성능은 어떠한가?

주요 결과

모델	FineInfer	AGOD	RewardlessGuidance	PerLLM
Yi-6B	58%	67%	74%	98%
LLaMA2-7B	58%	69%	77%	99%
LLaMA3-8B	58%	66%	74%	98%
Yi-9B	58%	66%	71%	97%

PerLLM은 모델 배치와 동적 대역폭에 걸쳐 처리 시간 요건 달성의 성공률이 97%를 넘는다.
PerLLM은 기저대안(FineInfer, AGOD, RewardlessGuidance)보다 처리량이 1.6배–2.2배 더 높다.
PerLLM은 기저대안 대비 에너지 비용을 50% 이상 감소시킨다.
실험 결과는 서비스 요건을 만족시키기 위한 동적 자원 할당으로 처리 효율이 더 높아짐을 보여준다.
CS-UCB 방식은 제약을 준수하면서 자원 다이내믹에 효과적으로 학습하고 적응한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.