QUICK REVIEW

[論文レビュー] Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

Ali Doosthosseini, Jonathan Decker|arXiv (Cornell University)|Jun 27, 2024

Distributed and Parallel Computing Systems被引用数 5

ひとこと要約

論文は、SSH制限付き HPC プロキシ、API ゲートウェイ、スケジューラを密接に統合し、ウェブインターフェースを Slurm ネイティブのウェブサービスアーキテクチャとして実現することで、HPCクラスター上で私的・リアルタイム LLM 推論を可能にする。LLMsを通常の Slurm ワークロードと並行して実行する。

ABSTRACT

The widespread adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs are well-suited for training LLMs, their batch scheduling paradigm is not designed to support real-time serving of AI applications. Cloud systems, on the other hand, are well suited for web services but commonly lack access to the computational power of HPC clusters, especially expensive and scarce high-end GPUs, which are required for optimal inference speed. We propose an architecture with an implementation consisting of a web service that runs on a cloud VM with secure access to a scalable backend running a multitude of LLM models on HPC systems. By offering a web service using our HPC infrastructure to host LLMs, we leverage the trusted environment of local universities and research centers to offer a private and secure alternative to commercial LLM services. Our solution natively integrates with the HPC batch scheduler Slurm, enabling seamless deployment on HPC clusters, and is able to run side by side with regular Slurm workloads, while utilizing gaps in the schedule created by Slurm. In order to ensure the security of the HPC system, we use the SSH ForceCommand directive to construct a robust circuit breaker, which prevents successful attacks on the web-facing server from affecting the cluster. We have successfully deployed our system as a production service, and made the source code available at \url{https://github.com/gwdg/chat-ai}

研究の動機と目的

データのプライバシーを保護し、第三者サービスを回避するための大規模言語モデルのオンプレミスでの私的なホスティングを動機づける。
リアルタイム LLM 推論をバッチワークロードと並行して実行する、Slurm 統合のシームレスなサービスアーキテクチャを設計する。
ディフェンス・イン・デプス機構とエンタープライズレベルのアクセス制御を通じて、セキュリティとデータプライバシーを確保する。

提案手法

クラウド VM 上のウェブサービスをスケーラブルな HPC バックエンドと統合し、Slurm を介して LLM をホストする。
認証とルーティングのために API ゲートウェイ（Kong）と SSO（OAuth2/OpenID）を活用する。
HPC ログインノードへ安全に接続するために SSH ForceCommand を用いた HPC プロキシを実装する。
Slurm(squeue) を照会してサービスインスタンスを維持し、需要に応じてジョブを提出する sbatch を用いたスケジューラを開発する。
OpenAI 互換の API を備えた効率的な GPU 加速推論のために vLLM を用いて LLM を実行する。
オプションとして External Proxy を介して外部モデルを公開し、Grafana/Prometheus で監視する。

Figure 1: Architecture of Chat AI. This diagram displays the main components of the service, consisting of an ESX web server that communicates to the login/service node, and the compute nodes of the HPC KISSKI platform.

実験結果

リサーチクエスチョン

RQ1Slurm ベースの HPC リソースをどのように再目的化して、リアルタイム LLM 推論を従来のバッチワークロードと並行して提供できるか？
RQ2ウェブ公開サービスが HPC クラスターを侵害するのを防ぐために、どのようなセキュリティ対策が必要か？
RQ3Slurm ネイティブな LLM 提供アーキテクチャの性能（待ち時間/スループット）とスケーラビリティはどの程度か？
RQ4提案されたアーキテクチャは GDPR 相当の規制下でのプライバシーとデータ保護にどのような影響を与えるか？
RQ5学術・研究環境での展開とユーザー導入における実務的な考慮点は何か？

主な発見

このアーキテクチャは既存の HPC インフラストラクチャ上で私的 LLm サービスの本番運用を実現している。
SSH ForceCommand と多層のディフェンス・イン・デプスによってセキュリティが強化され、クラスター侵害のリスクを低減している。
待ち時間とスループットの尺度を示す性能評価があり、リアルタイムのリクエストに対応できることを示している（詳細は抜粋には記載なし）。
スケジューラは Slurm の監視とリクエスト量に基づく動的なスケーリングによりサービスインスタンスを維持する。
ユーザー導入の考慮点とプライバシー影響について議論され、GDPR適合性と私的推論の利点を強調している。

Figure 2: Chat AI App. This shows the Chat AI web interface written with React and Vite with the chat history on the left, the prompt window on the top right and a drop down for model selection at the bottom right.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。