📚 [Paper] | 🤗 [Hugging Face]
We provide the full list of dependencies required to run and reproduce our experiments with the requirements.txt
file, which can be installed into any Python environment via pip:
pip install -r requirements.txt
In the cfgs/
folder, we provide the full list of configurations and hyper-parameters used in our work to train and evaluate L2D. In particular, the cfgs/model/
subfolder contains model-specific configurations named as:
{base_model}_lad.cfg
for L2D full diffusion path finetuning.{base_model}_lad_lora.cfg
for L2D diffusion path finetuning with LoRA.
For instance: llama_3.1_8b_instruct_lad_lora.cfg
.
However, you can train and evaluate any existing local models or ones hosted on Huggingface by simply modifying:
pretrained_model_dir = "my/model/name/or/path"
tokenizer_dir = "my/model/name/or/path"
While we make use of distributed training and evaluation setups with the deepspeed library, our experiments should be reproducible even with small computation budgets and a single GPU by regulating the micro_batch_size parameters. In the scripts/
folder, we provide further scripts to facilitate running experiments with our repository.
By default, checkpoints and results are saved in the experiments
folder.
Please, use the scripts/run_training.sh
script feeding as the first argument the GPUs available to utilize (e.g., 0 or 0,1 or 0,1,2,3... etc.) and as the second argument a path to the relevant config file (e.g., llama_3.2_1b_instruct_lad_lora.cfg
):
scripts/run_training.sh 0,1 cfgs/model/llama_3.2_1b_instruct_lad_lora.cfg
By default, this training phase uses a subset of the Smoltalk dataset. However, it can be easily extended to any custom dataset by making another traning task following the example structure in tasks/smoltalk.py
.
Please, use the scripts/run_bench_full.sh
script feeding as the first argument the GPUs available to utilize (e.g., 0 or 0,1 or 0,1,2,3... etc.), as second argument a path to the relevant config file (e.g., cfgs/model/llama_3.2_1b_lad_lora.cfg
), and as third argument the path to the saved PyTorch checkpoint file after training:
scripts/run_bench_full.sh 0,1 cfgs/model/llama_3.2_1b_lad_lora.cfg $CHECKPOINT_PATH
In our experiments, we made use of the lighteval/MATH
dataset for our results on the MATH task. Since this dataset has been temporarily removed from Huggingface, our default configuration files forego this setting. Please, add an equivalent local or hosted dataset back to cfgs\benchmark.cfg
to reactivate MATH evaluation.
Running experiments requires downloading models and datasets hosted on Huggingface. Hence, it requires logging into a Huggingface account with an access token, as explained here, with the following command:
huggingface-cli login
The default logging functionality saves results locally via TensorBoard. Furthermore, Weights & Biases logging is also supported. To use this, please modify the provided configuration files by adding:
save_wandb = True
To cite our work, you can use the following:
@article{sakana2025l2d,
title={Large Language Models to Diffusion Finetuning},
author={Cetin, Edoardo and Zhao, Tianyu and Tang, Yujin},
journal={arXiv preprint arXiv:2501.15781},
year={2025}
}