[Paper Review] Modeling Tabular data using Conditional GAN
CTGAN introduces a conditional generator with mode-specific normalization and training-by-sampling to model mixed discrete-continuous tabular data, outperforming Bayesian baselines and several GAN variants on real datasets in most metrics.
Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design TGAN, which uses a conditional generative adversarial network to address these challenges. To aid in a fair and thorough comparison, we design a benchmark with 7 simulated and 8 real datasets and several Bayesian network baselines. TGAN outperforms Bayesian methods on most of the real datasets whereas other deep learning methods could not.
Motivation & Objective
- Motivate the challenge of modeling joint distributions in mixed-type tabular data (continuous and discrete) with issues like multimodality and class imbalance.
- Propose CTGAN, a conditional GAN tailored for tabular data to address non-Gaussian continuous distributions and discrete imbalances.
- Introduce training-time techniques (mode-specific normalization, conditional generator, training-by-sampling) to improve fidelity and coverage.
- Provide a benchmarking suite (SDGym) comparing CTGAN to Bayesian networks and other GAN-based methods across simulated and real datasets.
Proposed method
- Mode-specific normalization using a variational Gaussian mixture model to identify and represent multiple modes per continuous column.
- Conditional generator with a conditioning vector that enforces generation conditioned on a discrete attribute value, including a cross-entropy penalty to enforce correct conditioning.
- Training-by-sampling strategy to balance exposure to rare discrete values by sampling conditions according to log-frequency in each discrete column.
- Wasserstein GAN with gradient penalty (WGAN-GP) and PacGAN framework to stabilize training and mitigate mode collapse.
- Network design using fully connected layers (no local structure in tabular data) with batch normalization and ReLU activations in generator and leaky ReLU with dropout in the critic.
Experimental results
Research questions
- RQ1How can tabular data with mixed continuous and discrete features be modeled to capture multimodal continuous distributions and highly imbalanced discrete categories?
- RQ2Can a conditional GAN tailored for tabular data outperform Bayesian network baselines and existing GAN approaches across diverse datasets?
- RQ3Do mode-specific normalization and training-by-sampling improve likelihood fidelity and downstream ML performance on synthetic tabular data?
- RQ4Is a conditional generator capable of producing data conditioned on specific discrete values for data augmentation?
Key findings
| Method | GM_Syn_Lsyn | GM_Syn_Ltest | BN_Syn_Lsyn | BN_Syn_Ltest | clf | reg |
|---|---|---|---|---|---|---|
| Identity | -2.61 | -2.61 | -9.33 | -9.36 | 0.743 | 0.14 |
| CLBN | -3.06 | -7.31 | -10.66 | -9.92 | 0.382 | -6.28 |
| PrivBN | -3.38 | -12.42 | -12.97 | -10.90 | 0.225 | -4.49 |
| MedGAN | -7.27 | -60.03 | -11.14 | -12.15 | 0.137 | -8.80 |
| VEEGAN | -10.06 | -4.22 | -15.40 | -13.86 | 0.143 | -6.50e6 |
| TableGAN | -8.24 | -4.12 | -11.84 | -10.47 | 0.162 | -3.09 |
| TVAE | -2.65 | -5.42 | -6.76 | -9.59 | 0.519 | -0.20 |
| CTGAN | -5.72 | -3.40 | -11.67 | -10.60 | 0.469 | -0.43 |
| Real | -9.33 | -9.36 | -9.33 | -9.36 | 0.743 | 0.14 |
- CTGAN outperforms Bayesian networks on most real datasets in the benchmarking study.
- Mode-specific normalization improves modeling of multimodal continuous columns compared to min-max or fixed-GMM setups.
- The conditional generator with training-by-sampling effectively handles imbalanced discrete columns, achieving strong performance on targets like credit datasets.
- CTGAN and TVAE both outperform some baselines on real datasets, with CTGAN achieving competitive results and sometimes surpassing TVAE.
- The proposed benchmarking suite (SDGym) enables fair comparisons across multiple datasets and evaluation metrics for synthetic tabular data generation.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.