QUICK REVIEW

[Paper Review] Modeling Tabular data using Conditional GAN

Lei Xu, Maria Skoularidou|arXiv (Cornell University)|Jul 1, 2019

Generative Adversarial Networks and Image Synthesis28 references94 citations

TL;DR

CTGAN introduces a conditional generator with mode-specific normalization and training-by-sampling to model mixed discrete-continuous tabular data, outperforming Bayesian baselines and several GAN variants on real datasets in most metrics.

ABSTRACT

Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design TGAN, which uses a conditional generative adversarial network to address these challenges. To aid in a fair and thorough comparison, we design a benchmark with 7 simulated and 8 real datasets and several Bayesian network baselines. TGAN outperforms Bayesian methods on most of the real datasets whereas other deep learning methods could not.

Motivation & Objective

Motivate the challenge of modeling joint distributions in mixed-type tabular data (continuous and discrete) with issues like multimodality and class imbalance.
Propose CTGAN, a conditional GAN tailored for tabular data to address non-Gaussian continuous distributions and discrete imbalances.
Introduce training-time techniques (mode-specific normalization, conditional generator, training-by-sampling) to improve fidelity and coverage.
Provide a benchmarking suite (SDGym) comparing CTGAN to Bayesian networks and other GAN-based methods across simulated and real datasets.

Proposed method

Mode-specific normalization using a variational Gaussian mixture model to identify and represent multiple modes per continuous column.
Conditional generator with a conditioning vector that enforces generation conditioned on a discrete attribute value, including a cross-entropy penalty to enforce correct conditioning.
Training-by-sampling strategy to balance exposure to rare discrete values by sampling conditions according to log-frequency in each discrete column.
Wasserstein GAN with gradient penalty (WGAN-GP) and PacGAN framework to stabilize training and mitigate mode collapse.
Network design using fully connected layers (no local structure in tabular data) with batch normalization and ReLU activations in generator and leaky ReLU with dropout in the critic.

Experimental results

Research questions

RQ1How can tabular data with mixed continuous and discrete features be modeled to capture multimodal continuous distributions and highly imbalanced discrete categories?
RQ2Can a conditional GAN tailored for tabular data outperform Bayesian network baselines and existing GAN approaches across diverse datasets?
RQ3Do mode-specific normalization and training-by-sampling improve likelihood fidelity and downstream ML performance on synthetic tabular data?
RQ4Is a conditional generator capable of producing data conditioned on specific discrete values for data augmentation?

Key findings

Method	GM_Syn_Lsyn	GM_Syn_Ltest	BN_Syn_Lsyn	BN_Syn_Ltest	clf	reg
Identity	-2.61	-2.61	-9.33	-9.36	0.743	0.14
CLBN	-3.06	-7.31	-10.66	-9.92	0.382	-6.28
PrivBN	-3.38	-12.42	-12.97	-10.90	0.225	-4.49
MedGAN	-7.27	-60.03	-11.14	-12.15	0.137	-8.80
VEEGAN	-10.06	-4.22	-15.40	-13.86	0.143	-6.50e6
TableGAN	-8.24	-4.12	-11.84	-10.47	0.162	-3.09
TVAE	-2.65	-5.42	-6.76	-9.59	0.519	-0.20
CTGAN	-5.72	-3.40	-11.67	-10.60	0.469	-0.43
Real	-9.33	-9.36	-9.33	-9.36	0.743	0.14

CTGAN outperforms Bayesian networks on most real datasets in the benchmarking study.
Mode-specific normalization improves modeling of multimodal continuous columns compared to min-max or fixed-GMM setups.
The conditional generator with training-by-sampling effectively handles imbalanced discrete columns, achieving strong performance on targets like credit datasets.
CTGAN and TVAE both outperform some baselines on real datasets, with CTGAN achieving competitive results and sometimes surpassing TVAE.
The proposed benchmarking suite (SDGym) enables fair comparisons across multiple datasets and evaluation metrics for synthetic tabular data generation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.