QUICK REVIEW

[论文解读] Scalable and Secure AI Inference in Healthcare: A Comparative Benchmarking of FastAPI and Triton Inference Server on Kubernetes

Ratul Ali|arXiv (Cornell University)|Jan 19, 2026

IoT and Edge/Fog Computing被引用 0

一句话总结

本论文在 Kubernetes 上对 FastAPI 与 Triton Inference Server 进行基准测试，展示将安全网关处理与高吞吐后端推理相结合的混合架构，应用于医疗保健 AI。

ABSTRACT

Efficient and scalable deployment of machine learning (ML) models is a prerequisite for modern production environments, particularly within regulated domains such as healthcare and pharmaceuticals. In these settings, systems must balance competing requirements, including minimizing inference latency for real-time clinical decision support, maximizing throughput for batch processing of medical records, and ensuring strict adherence to data privacy standards such as HIPAA. This paper presents a rigorous benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server. Leveraging a reference architecture for healthcare AI, we deployed a DistilBERT sentiment analysis model on Kubernetes to measure median (p50) and tail (p95) latency, as well as throughput, under controlled experimental conditions. Our results indicate a distinct trade-off. While FastAPI provides lower overhead for single-request workloads with a p50 latency of 22 ms, Triton achieves superior scalability through dynamic batching, delivering a throughput of 780 requests per second on a single NVIDIA T4 GPU, nearly double that of the baseline. Furthermore, we evaluate a hybrid architectural approach that utilizes FastAPI as a secure gateway for protected health information de-identification and Triton for backend inference. This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.

研究动机与目标

在医疗保健和制药领域推动可扩展且符合合规要求的 AI 推理。
在 Kubernetes 环境中评估 FastAPI 网关与 Triton Inference Server 之间的性能权衡。
评估使用 FastAPI 进行安全性/去标识化、使用 Triton 进行 GPU 支持的推理的混合架构。

提出的方法

在 Kubernetes 参考架构中部署 DistilBERT 情感模型。
在不同负载下比较基于 CPU 的 FastAPI 推理与基于 GPU 的 Triton 推理。
在 Triton 中启用动态批处理并测量 p50、p95 延迟及吞吐量。
使用预定义的模型注册表和健康检查实现零停机更新。
通过 FastAPI 网关强制 OAuth2/JWT 与预处理中的 PHI 去标识化来评估安全性。

实验结果

研究问题

RQ1在 Kubernetes 上，FastAPI 与 Triton 在医疗 NLP 推理中的延迟和吞吐量权衡如何？
RQ2混合架构是否能提升受监管的医疗保健 AI 部署的安全性与性能？
RQ3Triton 的动态批处理在并发负载下如何影响 p50、p95 延迟与吞吐量？
RQ4哪些体系结构准则能够在临床 AI 部署中最大化可用性与隐私性？

主要发现

FastAPI 对单项请求的 p50 延迟更低（22 ms），相比之下 Triton 为 28 ms，原因是系统开销较低。
Triton 的动态批处理（批量大小 16）以最高吞吐量 780 req/s 超越了未批处理的 Triton（420 req/s）和 FastAPI 基线（450 req/s）。
在所测试条件下，FastAPI 的尾延迟（45 ms）低于 Triton 的尾延迟（60 ms）。
Triton 的动态批处理在吞吐量方面带来显著提升，仅使 p50 延迟轻微上升（34 ms）。
使用 FastAPI 作为安全网关、Triton 作为计算后端的混合架构，为企业医疗保健 AI 部署提供实用的最佳实践。
在预处理阶段进行 PHI 去标识化可降低进入推理服务器前的数据暴露风险。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。