QUICK REVIEW

[论文解读] Synthetic Data in Healthcare

Daniel McDuff, Theodore Curran|arXiv (Cornell University)|Apr 6, 2023

demographic modeling and climate adaptation被引用 16

一句话总结

本文综述了合成数据的生成方式（物理、统计和混合模型）、在医疗保健中的应用、对隐私和公平性的收益，以及它们带来的风险和挑战。

ABSTRACT

Synthetic data are becoming a critical tool for building artificially intelligent systems. Simulators provide a way of generating data systematically and at scale. These data can then be used either exclusively, or in conjunction with real data, for training and testing systems. Synthetic data are particularly attractive in cases where the availability of ``real'' training examples might be a bottleneck. While the volume of data in healthcare is growing exponentially, creating datasets for novel tasks and/or that reflect a diverse set of conditions and causal relationships is not trivial. Furthermore, these data are highly sensitive and often patient specific. Recent research has begun to illustrate the potential for synthetic data in many areas of medicine, but no systematic review of the literature exists. In this paper, we present the cases for physical and statistical simulations for creating data and the proposed applications in healthcare and medicine. We discuss that while synthetics can promote privacy, equity, safety and continual and causal learning, they also run the risk of introducing flaws, blind spots and propagating or exaggerating biases.

研究动机与目标

推动在医疗保健中使用合成数据，以解决隐私、共享障碍和数据稀缺问题。
描述物理、统计和混合数据生成方法及其在医疗数据中的适用性。
讨论 sim2real 转移概念（领域随机化、领域自适应、可微分仿真）及其与医疗任务的相关性。
突出合成数据在医学领域的潜在收益（隐私、公平、安全、持续学习）与风险（偏差、缺陷、未知性）。

提出的方法

将合成数据生成分类为物理模型、统计模型和混合方法。
描述减少合成数据与真实数据之间领域差距的 sim2real 技术。
讨论可微分仿真在优化仿真器保真度方面的优势。
概述合成数据在不同模态中的应用（结构化电子病历、自然语言、生理信号、医学影像）。

实验结果

研究问题

RQ1生成合成医疗数据的主要方法及其权衡是什么？
RQ2合成数据如何在不同的医疗模态和任务中应用？
RQ3在医疗保健中使用合成数据的益处与风险有哪些，包括隐私、公平性与安全性方面的关注？
RQ4在医疗应用中缩小 sim2real 差距的有效策略有哪些？

主要发现

合成数据可以产生与用真实数据训练的模型相当的性能，在某些情况下，当与真实数据结合时甚至优于只用真实数据的模型。
合成数据可以提升隐私、实现数据共享，并通过生成多样化表型和稀有事件来支持公平性。
sim2real 策略（领域随机化、领域自适应、可微分仿真）有助于在医疗环境中将模型从合成数据推广到真实数据。
合成数据支持持续学习和快速模型更新，并且可以在不伤及患者的情况下促进安全性测试，但也存在偏差和未被识别的伪影风险。
文献显示在心脏病学、皮肤病学、影像学、眼科学、传染病等领域的成功应用，使用了统计和物理/混合仿真器。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。