QUICK REVIEW

[論文レビュー] Scalable and Secure AI Inference in Healthcare: A Comparative Benchmarking of FastAPI and Triton Inference Server on Kubernetes

Ratul Ali|arXiv (Cornell University)|Jan 19, 2026

IoT and Edge/Fog Computing被引用数 0

ひとこと要約

この論文は、ヘルスケアAIのためにKubernetes上でFastAPIとTriton Inference Serverをベンチマークし、セキュアゲートウェイ処理を高スループットのバックエンド推論と組み合わせたハイブリッドアーキテクチャを示します。

ABSTRACT

Efficient and scalable deployment of machine learning (ML) models is a prerequisite for modern production environments, particularly within regulated domains such as healthcare and pharmaceuticals. In these settings, systems must balance competing requirements, including minimizing inference latency for real-time clinical decision support, maximizing throughput for batch processing of medical records, and ensuring strict adherence to data privacy standards such as HIPAA. This paper presents a rigorous benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server. Leveraging a reference architecture for healthcare AI, we deployed a DistilBERT sentiment analysis model on Kubernetes to measure median (p50) and tail (p95) latency, as well as throughput, under controlled experimental conditions. Our results indicate a distinct trade-off. While FastAPI provides lower overhead for single-request workloads with a p50 latency of 22 ms, Triton achieves superior scalability through dynamic batching, delivering a throughput of 780 requests per second on a single NVIDIA T4 GPU, nearly double that of the baseline. Furthermore, we evaluate a hybrid architectural approach that utilizes FastAPI as a secure gateway for protected health information de-identification and Triton for backend inference. This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.

研究の動機と目的

医療および医薬品分野でのスケーラブルで準拠したAI推論を動機づける。
Kubernetes 設定における FastAPI ゲートウェイと Triton Inference Server のパフォーマンス比較を評価する。
セキュリティ/デ識別のための FastAPI とGPUベースの推論のための Triton を組み合わせたハイブリッドアーキテクチャを評価する。

提案手法

Kubernetes の参照アーキテクチャに DistilBERT センチメントモデルをデプロイする。
CPU ベースの FastAPI 推論とGPU ベースの Triton を負荷変動下で比較する。
Triton で動的バッチ処理を有効化し、p50, p95 レイテンシとスループットを測定する。
ゼロダウンタイム更新のために事前定義されたモデルレジストリとヘルスチェックを使用する。
OAuth2/JWT を施行する FastAPI ゲートウェイと前処理でPHIデ識別を適用してセキュリティを評価する。

実験結果

リサーチクエスチョン

RQ1Kubernetes 上でのヘルスケアNLP推論における FastAPI と Triton の遅延とスループットのトレードオフは何か。
RQ2ハイブリッドアーキテクチャは規制対象のヘルスケアAI導入におけるセキュリティとパフォーマンスを改善するか。
RQ3Triton の動的バッチ処理は同時負荷下で p50, p95 レイテンシとスループットにどう影響するか。
RQ4臨床AIの導入において可用性とプライバシーを最大化するアーキテクチャ指針は何か。

主な発見

FastAPI は単一リクエストでの p50 レイテンシが Triton より低く（22 ms vs 28 ms）システムオーバーヘッドが小さいためである。
動的バッチ処理を用いた Triton（バッチサイズ 16）は最も高いスループット（780 req/s）を示し、非バッチの Triton（420 req/s）および FastAPI ベースライン（450 req/s）を上回った。
テスト条件下では Tail レイテンシが FastAPI の方が低く（45 ms）Triton の 60 ms より優れている。
Triton の動的バッチ処理は、p50 レイテンシの増分が控えめで substantial なスループット向上を提供する（34 ms）。
FastAPI をセキュアゲートウェイとして、Triton を計算処理に用いるハイブリッドアーキテクチャは企業向けヘルスケアAI導入の実務的ベストプラクティスを提供する。
Preprocessing における PHI デ識別は推論サーバーに到達する前のデータ露出リスクを低減する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。