Skip to content

omaralvarez/gentab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Python

GenTab

Synthetic Tabular Data Generation Library

Overview

This Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...

Features

πŸ”© Pre-process your data.

πŸ•œ State-of-the-art models.

♻️ Easy to use and customize.

Install

The gentab library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.

pip install gentab

Available Generators

Below is the list of the generators currently available in the library.

Linear

Model Example Paper
SMOTE Open In Colab link
ADASYN Open In Colab link

PDF

Model Example Paper
Gaussian Copula Open In Colab link

AE

Model Example Paper
TVAE Open In Colab link

GAN

Model Example Paper
CTGAN Open In Colab link
CTAB-GAN Open In Colab link
CTAB-GAN+ Open In Colab link

Diffusion

Model Example Paper
ForestDiffusion Open In Colab link

LLM

Model Example Paper
GReaT Open In Colab link
Tabula Open In Colab link

Hybrid

Model Example Papers
Copula GAN Open In Colab link link
AutoDiffusion Open In Colab link

Examples

Generation

from gentab.generators import AutoDiffusion
from gentab.evaluators import MLP
from gentab.data import Config, Dataset
from gentab.utils import console

config = Config("configs/playnet.json")

dataset = Dataset(config)
dataset.reduce_size(
    {
        "left_attack": 0.97,
        "right_attack": 0.97,
        "right_transition": 0.9,
        "left_transition": 0.9,
        "time_out": 0.8,
        "left_penal": 0.5,
        "right_penal": 0.5,
    }
)
dataset.merge_classes(
    {
        "attack": ["left_attack", "right_attack"],
        "transition": ["left_transition", "right_transition"],
        "penalty": ["left_penal", "right_penal"],
    }
)
dataset.reduce_mem()

console.print(dataset.class_counts(), dataset.row_count())
generator = AutoDiffusion(dataset)
generator.generate()
console.print(dataset.generated_class_counts(), dataset.generated_row_count())

evaluator = MLP(generator)
evaluator.evaluate()

dataset.save_to_disk(generator)

Tuning

from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

generator = AutoDiffusion(dataset)

evaluator = LightGBM(generator)

trials = 10
time = 60 * 60 * 8
tuner = AutoDiffusionTuner(evaluator, trials, timeout=time)
tuner.tune()
tuner.save_to_disk()

Loading Stored Synthetic Datasets

from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset

config = Config("configs/adult.json")

dataset = Dataset(config)
dataset.merge_classes({
    "<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()

# Load previously saved dataset...
generator = AutoDiffusion(dataset)
generator.load_from_disk()

# Do work with previously generated but not tuned dataset...
evaluator = LightGBM(generator)
evaluator.evaluate()
evaluator.evaluate_baseline()

# Load previously tuned and saved dataset...
tuner = AutoDiffusionTuner(evaluator, 0)
tuner.load_from_disk()

# Do work with previously tuned dataset...
evaluator.evaluate()
evaluator.evaluate_baseline()

πŸ“œ Citation

@Article{mures2025mitigating,
  author = {Mures, Omar A. Mures, Omar A. and Taibo, Javier and Padr{\'o}n, Emilio J. and Iglesias-Guitian, Jose A.},
  title = {Mitigating Class Imbalance in Tabular Data through Neural Network-based Synthetic Data Generation: A Comprehensive Survey and Library},
  year = {2025}
}

Acknowledgements

This project has received support from the Spanish Ministry of Science and Innovation (AEI/PID2020-115734RB-C22 and AEI/RYC2018-025385-I), Xunta de Galicia (ED431F 2021/11) and EU-FEDER Galicia (ED431G 2019/01).

About

Tabular Synthetic Data Augmentation Library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages