ML-FIGS-LDM is a Latent Diffusion Model (LDM) for generating educational figures. The AutoencoderKL is trained using a Text Perceptual Loss to reconstruct more readable text within the figures.
Dataset ML-Figs
We present the ML-Figs dataset, a comprehensive collection of 4,302 figures and captions extracted from 43 machine learning books. This dataset is designed to advance research in understanding and interpreting educational materials. It includes 4000 samples for training and 302 for testing.
Expanded Dataset Ml-Figs-SciCap
To improve the coverage and diversity of our datasets, we decided to expand the ML-Figs dataset by adding extra figures and captions from the SciCap dataset, particularly those from ACL papers. This expansion ML-Figs + SciCap has boosted the total size of our dataset to an impressive 19,514 samples.
The Text Perceptual Loss calculates the perceptual similarity between the text regions of two images by extracting text bounding boxes. The mean squared error (MSE) loss is then computed for each corresponding text region. The final loss is the average of these individual region losses. Text Perceptual Loss (TPL)
pip install -r requirements.txt
or create a conda environment:
conda env create -f environment.yaml
conda activate ml-figs-ldm
pip install -e .
Update albumentations package:
python scripts/update_albm_package.py
python main.py --base configs/ml-figs-scicap-ldm.yaml --train=True --scale_lr=False
python main.py --base configs/ml-figs-scicap-vae.yaml --train=True
python scripts/eval_ldm.py
python scripts/eval_FID_ldm.py
python main.py scripts/eval_vae.py
Model A trained on ML-Figs, Model B trained on ML-Figs + SciCap. TPL: Text Perceptual Loss. SD refers to Stable Diffusion v1-4 trained on LAION.
Autoencoder and LDM models are available for download at huggingface.co/salamnocap/ml-figs-ldm. The models are trained on the ML-Figs+SciCap dataset.