QUICK REVIEW

[논문 리뷰] Scalable and Secure AI Inference in Healthcare: A Comparative Benchmarking of FastAPI and Triton Inference Server on Kubernetes

Ratul Ali|arXiv (Cornell University)|2026. 01. 19.

IoT and Edge/Fog Computing인용 수 0

한 줄 요약

이 논문은 Kubernetes에서 의료 AI를 대상으로 FastAPI와 Triton Inference Server를 비교하고, 보안 게이트웨이 처리와 고처리량 백엔드 추론을 결합한 하이브리드 아키텍처를 입증합니다.

ABSTRACT

Efficient and scalable deployment of machine learning (ML) models is a prerequisite for modern production environments, particularly within regulated domains such as healthcare and pharmaceuticals. In these settings, systems must balance competing requirements, including minimizing inference latency for real-time clinical decision support, maximizing throughput for batch processing of medical records, and ensuring strict adherence to data privacy standards such as HIPAA. This paper presents a rigorous benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server. Leveraging a reference architecture for healthcare AI, we deployed a DistilBERT sentiment analysis model on Kubernetes to measure median (p50) and tail (p95) latency, as well as throughput, under controlled experimental conditions. Our results indicate a distinct trade-off. While FastAPI provides lower overhead for single-request workloads with a p50 latency of 22 ms, Triton achieves superior scalability through dynamic batching, delivering a throughput of 780 requests per second on a single NVIDIA T4 GPU, nearly double that of the baseline. Furthermore, we evaluate a hybrid architectural approach that utilizes FastAPI as a secure gateway for protected health information de-identification and Triton for backend inference. This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.

연구 동기 및 목표

헬스케어 및 제약 분야에서 확장 가능하고 컴플라이언스에 부합하는 AI 추론을 촉진합니다.
Kubernetes 설정에서 FastAPI 게이트웨이와 Triton Inference Server 간의 성능 트레이드오프를 평가합니다.
보안/식별 제거를 위해 FastAPI를, GPU 기반 추론을 위해 Triton을 사용하는 하이브리드 아키텍처를 평가합니다.

제안 방법

Kubernetes 참조 아키텍처에서 DistilBERT 감정 분석 모델을 배포합니다.
다양한 부하에서 CPU 기반 FastAPI 추론과 GPU 기반 Triton 추론을 비교합니다.
Triton에서 동적 배치화를 활성화하고 p50, p95 대기시간과 처리량을 측정합니다.
제로다운타임 업데이트를 위한 사전 정의된 모델 레지스트리 및 헬스 체킹을 사용합니다.
전처리에서 PHI 비식별화를 수행하고 OAuth2/JWT를 적용하는 FastAPI 게이트웨이를 통해 보안을 평가합니다.

실험 결과

연구 질문

RQ1Kubernetes에서 의료용 NLP 추론에 대해 FastAPI와 Triton 간의 대기 시간과 처리량 트레이드오프는 무엇인가?
RQ2규제 대상 헬스케어 AI 배포에서 하이브리드 아키텍처가 보안성과 성능을 향상시키는가?
RQ3동적 배치가 동시 부하하에서 p50, p95 대기시간 및 처리량에 어떤 영향을 미치는가?
RQ4임상 AI 배포에서 가용성과 프라이버시를 극대화하는 아키텍처 가이드라인은 무엇인가?

주요 결과

FastAPI는 시스템 오버헤드가 더 낮아 단일 아이템 요청의 p50 대기시간이 (22 ms)로 Triton의 (28 ms)보다 더 낮다.
Triton의 동적 배치화(배치 크기 16)는 가장 높은 처리량(780 req/s)을 보이며 비배치 Triton(420 req/s)과 FastAPI 베이스라인(450 req/s)을 능가한다.
테스트 조건에서 꼬리 대기시간은 FastAPI(45 ms)가 Triton(60 ms)보다 더 낮다.
Triton의 동적 배치는 p50 대기시간 증가가 34 ms로 다소 있지만 처리량 이득이 큼.
FastAPI를 보안 게이트웨이로, Triton을 컴퓨트로 사용하는 하이브드 아키텍처가 기업용 헬스케어 AI 배포에 실용적인 모범 사례를 제공합니다.
전처리의 PHI 비식별화는 추론 서버에 도달하기 전에 데이터 노출 위험을 줄입니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.