Skip to content

v0.6.0

Compare
Choose a tag to compare
@pbontrager pbontrager released this 24 Mar 21:52
· 167 commits to main since this release

Highlights

We are releasing torchtune v0.6.0 with exciting new features and improved distributed training support! This release includes TensorParallel (TP) + FSDP training, TP inference, multinode training, and a full DPO distributed recipe. We also landed Phi4, logging with MLFlow, and improved support for NPUs.

Tensor Parallel training + inference (#2245) (#2330)

Tensor parallelism is a model parallelism technique for distributed training. When combined with FSDP, TP allows more efficient training of large models across many GPUs versus FSDP alone. While FSDP splits your dataset across different GPUs, TP splits each model layer across GPUs, allowing model layers to be computed much faster at larger scales. In addition to training, we've also enabled TP inference, which is crucial for generating text or doing reinforcement learning when your model doesn't fit on a single GPU. To learn more on how to define a TP model, take a look here.

Multinode training support (#2301)

Multinode finetuning is now supported, allowing you to train larger models faster. Using SLURM you can launch tune run across multiple nodes and train just as you would now on a single machine. We include one example slurm recipe and a tutorial for getting started here.

Full Distributed DPO recipe (#2275)

We've had DPO support for some time but we've now added the ability to train DPO using all of the distributed goodies that we've had and those listed above. This improves our coverage of recipes that you can use on the increasing number of 70B+ models. To finetune Llama 3.1 8B with Full Distributed DPO, you can run:

# Download Llama 3.1 8B
tune download meta-llama/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"

# Finetune on four devices
tune run --nnodes 1 --nproc_per_node 4 full_dpo_distributed --config llama3_1/8B_full_dpo

A special thanks to @sam-pi for adding this recipe.

Phi 4 models (#1835)

We now support Phi 4! This includes the 14B model for now, with recipes for full, LoRA, and QLoRA finetuning on one or more devices. For example, you can full finetune Phi 4 14B on a single GPU by running:

# Download Phi 4 14B
tune download microsoft/phi-4

# pip install bits and bytes
pip install bitsandbytes

# Finetune on a single GPU
tune run full_finetune_single_device --config phi4/14B_full_low_memory

A huge thanks to @krammnic for landing these models!

Improved NPU support (#2234)

We are continuing to improve our support for Ascend NPU devices. This release includes fixes and enhancements to give you better performance with the NPU backend. Thank you to @Nicorgi for the help!

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.6.0