This is the official implementation of the paper
"Generating High-Diversity Synthetic Tabular Data via a Less-Constrained Prior" (IJACI-ECAI 2026).
We recommend using a virtual environment.
Install the required dependencies using:
pip install -r requirements.txtThis codebase uses Weights & Biases (wandb) for experiment tracking.
- Create a wandb account at https://wandb.ai
- Log in from the terminal:
wandb login- In
main.pyorinference.py, set your wandb information at the top of the file:
project = "YOUR_PROJECT_NAME"
entity = "YOUR_WANDB_USERNAME"This repository provides implementations of our proposed method (SPT) as well as baseline methods such as CTGAN, TVAE, TabMT, TabDDPM, STaSy, TabSyn, and AutoDiff.
The overall workflow is as follows:
- Prepare a wandb account and configure it in the code
- Move to the directory of the desired model (e.g.,
SPT,TabSyn) - Train the model using
main.py - Generate synthetic samples using
inference.py
First, navigate to the directory corresponding to the baseline or method you wish to execute. For example:
cd SPT
# or
cd baseline/TabSyn
# or
cd baseline/TabDDPMTo train a model, run the following command:
python main.py --dataset [NAME_OF_DATASET] [ADDITIONAL_ARGS]Here, [ADDITIONAL_ARGS] corresponds to model-specific hyperparameters, which are summarized below.
| Model | Required Arguments |
|---|---|
| SPT | dataset, var, d_token, denoising_dim, lr1, batch_size1, batch_size2, max_beta |
| TabSyn | dataset, d_token, denoising_dim, lr1, batch_size1, batch_size2, max_beta |
| TabDDPM | dataset, weight_decay, lr, dim_embed, batch_size, num_layers |
| CTGAN | dataset, latent_dim, batch_size, epochs, generator_dim, discriminator_dim |
| TVAE | dataset, latent_dim, batch_size, epochs, loss_factor |
| TabMT | dataset, dim_transformer, num_transformer_heads, num_transformer_layer, batch_size, epochs, max_clusters |
| STaSy | dataset, model.sigma_min, model.sigma_max, optim.lr, model.beta0, model.alpha0 |
| AutoDiff | dataset, d_token, denoising_dim, lr1, batch_size1, batch_size2 |
- Supported datasets for
--datasetare:
abalone, adult, anuran, banknote, breast, concrete,
kings, letter, loan, redwine, shoppers, whitewine
-
To ensure reproducibility, you can fix the random seed during training by adding the
--seedargument. -
Below is an example command for training SPT on the redwine dataset with a fixed seed:
python main.py --dataset redwine --var 0.1 --d_token 4 --denoising_dim 1024 \
--lr1 0.001 --batch_size1 512 --batch_size2 512 --max_beta 0.01 --seed 0After training, synthetic samples can be generated using the following command:
python inference.py --dataset [NAME_OF_DATASET] [ADDITIONAL_ARGS]Here, [ADDITIONAL_ARGS] must be identical to those used during training, as they are required to correctly load the trained model checkpoint.
-
The same set of hyperparameters used in the training phase must be provided during inference.
-
During inference, the argument
--veris used to specify the experiment version stored in wandb.--vercontrols which trained checkpoint is loaded and does not need to matchseed
-
Below is an example command for generating synthetic data using SPT trained on the redwine dataset:
python inference.py --dataset redwine --var 0.1 --d_token 4 --denoising_dim 1024 \
--lr1 0.001 --batch_size1 512 --batch_size2 512 --max_beta 0.01 --ver 0This command loads the trained SPT model corresponding to version ver = 0 from wandb and generates synthetic samples for the redwine dataset.
We provide a comprehensive example to ensure easy reproducibility of our results.
- File Location:
SPT/example.ipynb - Content: This notebook contains the entire pipeline for the anuran dataset, including:
- Training: Step-by-step model training using the SPT method.
- Inference: Generating synthetic tabular data from the trained model.
- Evaluation: Measuring the quality of synthetic data using all metrics (GoF, MMD, PCD, Coverage, etc.).
You can run this notebook to verify the performance and understand the workflow of our proposed method.