[Paper Review] Synthetic Data in Healthcare
The paper surveys how synthetic data are generated (physical, statistical, and hybrid models), their healthcare applications, benefits for privacy and equity, and the risks and challenges they introduce.
Synthetic data are becoming a critical tool for building artificially intelligent systems. Simulators provide a way of generating data systematically and at scale. These data can then be used either exclusively, or in conjunction with real data, for training and testing systems. Synthetic data are particularly attractive in cases where the availability of ``real'' training examples might be a bottleneck. While the volume of data in healthcare is growing exponentially, creating datasets for novel tasks and/or that reflect a diverse set of conditions and causal relationships is not trivial. Furthermore, these data are highly sensitive and often patient specific. Recent research has begun to illustrate the potential for synthetic data in many areas of medicine, but no systematic review of the literature exists. In this paper, we present the cases for physical and statistical simulations for creating data and the proposed applications in healthcare and medicine. We discuss that while synthetics can promote privacy, equity, safety and continual and causal learning, they also run the risk of introducing flaws, blind spots and propagating or exaggerating biases.
Motivation & Objective
- Motivate the use of synthetic data to address privacy, sharing barriers, and data scarcity in healthcare.
- Characterize physical, statistical, and hybrid data generation methods and their applicability to medical data.
- Discuss sim2real transfer concepts (domain randomization, domain adaptation, differentiable simulation) and their relevance to healthcare tasks.
- Highlight potential benefits (privacy, equity, safety, continual learning) and risks (bias, flaws, unknowns) of synthetic data in medicine.
Proposed method
- Classify synthetic data generation into physical models, statistical models, and hybrid approaches.
- Describe sim2real techniques to reduce domain gaps between synthetic and real data.
- Discuss advantages of differentiable simulation for optimizing simulator fidelity.
- Summarize how synthetic data are used across modalities (structured EHR, natural language, physiological signals, medical imaging).
Experimental results
Research questions
- RQ1What are the main methods for generating synthetic healthcare data and their trade-offs?
- RQ2How can synthetic data be applied across different medical modalities and tasks?
- RQ3What are the benefits and risks of using synthetic data in healthcare, including privacy, equity, and safety concerns?
- RQ4What are effective strategies to bridge sim2real gaps in medical applications?
Key findings
- Synthetic data can yield models comparable to those trained on real data and, in some cases, outperform real-data models when combined with real data.
- Synthetic data can improve privacy, enable data sharing, and support fairness by enabling diverse phenotypes and rare-event generation.
- Sim2real strategies (domain randomization, domain adaptation, differentiable simulation) help generalize models from synthetic to real data in healthcare settings.
- Synthetic data support continual learning and rapid model updates, and can facilitate safety testing without patient risk, but carry risks of bias and unrecognized artifacts.
- The literature shows successful applications in cardiology, dermatology, imaging, ophthalmology, infectious diseases, and more, using both statistical and physical/hybrid simulators.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.