Release v0.6.0 · pytorch/torchtune

Highlights

We are releasing torchtune v0.6.0 with exciting new features and improved distributed training support! This release includes TensorParallel (TP) + FSDP training, TP inference, multinode training, and a full DPO distributed recipe. We also landed Phi4, logging with MLFlow, and improved support for NPUs.

Tensor Parallel training + inference (#2245) (#2330)

Tensor parallelism is a model parallelism technique for distributed training. When combined with FSDP, TP allows more efficient training of large models across many GPUs versus FSDP alone. While FSDP splits your dataset across different GPUs, TP splits each model layer across GPUs, allowing model layers to be computed much faster at larger scales. In addition to training, we've also enabled TP inference, which is crucial for generating text or doing reinforcement learning when your model doesn't fit on a single GPU. To learn more on how to define a TP model, take a look here.

Multinode training support (#2301)

Multinode finetuning is now supported, allowing you to train larger models faster. Using SLURM you can launch tune run across multiple nodes and train just as you would now on a single machine. We include one example slurm recipe and a tutorial for getting started here.

Full Distributed DPO recipe (#2275)

We've had DPO support for some time but we've now added the ability to train DPO using all of the distributed goodies that we've had and those listed above. This improves our coverage of recipes that you can use on the increasing number of 70B+ models. To finetune Llama 3.1 8B with Full Distributed DPO, you can run:

# Download Llama 3.1 8B
tune download meta-llama/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"

# Finetune on four devices
tune run --nnodes 1 --nproc_per_node 4 full_dpo_distributed --config llama3_1/8B_full_dpo

A special thanks to @sam-pi for adding this recipe.

Phi 4 models (#1835)

We now support Phi 4! This includes the 14B model for now, with recipes for full, LoRA, and QLoRA finetuning on one or more devices. For example, you can full finetune Phi 4 14B on a single GPU by running:

# Download Phi 4 14B
tune download microsoft/phi-4

# pip install bits and bytes
pip install bitsandbytes

# Finetune on a single GPU
tune run full_finetune_single_device --config phi4/14B_full_low_memory

A huge thanks to @krammnic for landing these models!

Improved NPU support (#2234)

We are continuing to improve our support for Ascend NPU devices. This release includes fixes and enhancements to give you better performance with the NPU backend. Thank you to @Nicorgi for the help!

What's Changed

Small readme, config updates by @ebsmothers in #2157
Using FormattedCheckpointFiles in configs by @SalmanMohammadi in #2147
Move get_world_size_and_rank to utils by @joecummings in #2155
Faster intermediate checkpoints with DCP async save in TorchTune by @saumishr in #2006
torchdata integration - multi-dataset and streaming support by @andrewkho in #1929
Allow higher version of lm-eval by @joecummings in #2165
Using FormattedCheckpointFiles in configs... round 2 by @SalmanMohammadi in #2167
[EZ] Fix set_torch_num_threads in multi-node. by @EugenHotaj in #2164
Fix adapter_config.json saving in DPO recipes by @SalmanMohammadi in #2162
Fix excessive QAT warning by @andrewor14 in #2174
Add output dir to top of all configs by @ebsmothers in #2183
change saving logic by @felipemello1 in #2182
output_dir not in ckpt dir by @felipemello1 in #2181
Set teacher ckptr output_dir to match student in KD configs by @ebsmothers in #2185
raise compile error by @felipemello1 in #2188
Update DPO Max Seq Len by @pbontrager in #2176
Llama3.2 3B eval by @ReemaAlzaid in #2186
Update typo in docstring for _generation.get_causal_mask_from_padding… by @psoulos in #2187
new docs for checkpointing by @felipemello1 in #2189
Update E2E Tutorial w/ vLLM and HF Hub by @joecummings in #2192
pytorch/torchtune/tests/torchtune/modules/_export by @gmagogsfm in #2179
update torchtune version by @felipemello1 in #2195
[metric_logging][wandb] Fix wandb metric logger config save path by @akashc1 in #2196
Add evaluation file for code_llama2 model by @ReemaAlzaid in #2209
Adds message_transform link from SFTDataset docstring to docs by @thomasjpfan in #2219
Change alpaca_dataset train_on_input doc to match default value by @mirceamironenco in #2227
Set default value for 'subset' parameter in the_cauldron_dataset by @Ankur-singh in #2228
Add eval config for QWEN2_5 model using 0.5B variant by @Ankur-singh in #2230
T5 Encoder by @calvinpelletier in #2069
Migrate distributed state dict API by @mori360 in #2138
Flux Autoencoder by @calvinpelletier in #2098
Fix gradient scaling to account for world_size normalization by @mirceamironenco in #2172
[Small fix] Update CUDA version in README by @acisseJZhong in #2242
Adds clip_grad_norm to all recipe config that supports it by @thomasjpfan in #2220
llama 3.1 has correct max_seq_len for all versions by @akashc1 in #2203
Log grad norm aggregated over all ranks, not just rank zero by @ebsmothers in #2248
Remove example inputs from aoti_compile_and_package by @angelayi in #2244
Fix issue #2243, update the document to show correct usage by @insop in #2252
[EZ] Fix config bug where interpolation happens too early by @EugenHotaj in #2236
Small formatting fix by @krammnic in #2256
Multi-tile support in vision rope by @RdoubleA in #2247
Add AlpacaToMessages to message transforms doc page by @AndrewMead10 in #2265
Add a "division by zero" check in chunked loss handling in kd_losses.py by @insop in #2239
Fixing docstring linter by @SalmanMohammadi in #2163
PPO Performance Improvements by @SalmanMohammadi in #2066
Add Ascend NPU as a backend for single device recipes by @Nicorgi in #2234
Fix tests due to upgrade to cuda126 by @acisseJZhong in #2260
Fix a bug in set float32 precision by @Nicorgi in #2271
Construct EarlyFusion's encoder_token_ids on correct device by @ebsmothers in #2276
Sample packing for ConcatDataset by @ebsmothers in #2278
Added Distributed(Tensor Parallel) Inference Recipe by @acisseJZhong in #2245
Logging resolved config by @Ankur-singh in #2274
Removing SimPOLoss by @SalmanMohammadi in #2290
Proper prefix handling in EarlyFusion sd hooks by @ebsmothers in #2291
Remove deprecated components for 0.6.0 by @RdoubleA in #2293
Update the e2e flow tutorial to fix errors of generate by @iseeyuan in #2251
profiling ops on xpu by @songhappy in #2249
Refactored modules/tokenizers to be a subdir of modules/transforms by @Ankur-singh in #2231
Update model builders by @Ankur-singh in #2282
[EZ] Only log deprecation warning on rank zero by @RdoubleA in #2308
[ez] Add output_dir field to a couple configs by @ebsmothers in #2309
Disable DSD and fix bitsandbytes test by @RdoubleA in #2314
fix state dict hook for early fusion models by @acisseJZhong in #2317
Adding reverse and symmetric KLD losses by @insop in #2094
[WIP] 'tune cat' command for pretty printing configuration files by @Ankur-singh in #2298
Use checkout@v4 / upload@v4 for docs build by @joecummings in #2322
Fix stop tokens in PPO by @RedTachyon in #2304
Update PT pin for modules/_export by @Jack-Khuu in #2336
Update to proper EOS ids for Qwen2 and Qwen2.5 by @joecummings in #2342
Multinode support in torchtune by @joecummings in #2301
[Bug Fix]Disable DSD for saving ckpt by @acisseJZhong in #2346
Update README for multinode by @joecummings in #2348
added tie_word_embeddings to llama3_2 models by @jingzhaoou in #2331
Fix saving adapter weights after disabling DSD by @acisseJZhong in #2351
Remove "ft-" prefix from checkpoint shards. by @EugenHotaj in #2354
Full DPO Distributed by @sam-pi in #2275
[Fix Test] Fix failed generation test by pining pytorch nightlies by @acisseJZhong in #2362
TP + FSDP distributed training (full finetuning) by @acisseJZhong in #2330
Add max-autotune try/except if flex attn breaks by @felipemello1 in #2357
readme updates for full DPO distributed recipe by @ebsmothers in #2363
Fix Qwen config by @acisseJZhong in #2377
feat: Added cfg.cudnn_deterministic_mode flag by @bogdansalyp in #2367
Add Phi4 by @krammnic in #2197
Add tests and implementation for disabling dropout layers in models by @Ankur-singh in #2378
nit: Phi4 to readme by @krammnic in #2383
Implements MLFlowLogger by @nathan-az in #2365
'ft-' prefix occurrence removal by @rajuptvs in #2385
check if log_dir is not none by @felipemello1 in #2389
HF tokenizers: initial base tokenizer support by @ebsmothers in #2350
Simplify README and prominently display recipes by @joecummings in #2349
Renamed parallelize_plan to tensor_parallel_plan by @pbontrager in #2387
Fix optimizer_in_backward at loading opt_state_dict in distributed recipes by @mori360 in #2390
Add core dependency on stable torchdata (#2408) by @pbontrager in #2509

New Contributors

@saumishr made their first contribution in #2006
@andrewkho made their first contribution in #1929
@EugenHotaj made their first contribution in #2164
@ReemaAlzaid made their first contribution in #2186
@psoulos made their first contribution in #2187
@gmagogsfm made their first contribution in #2179
@akashc1 made their first contribution in #2196
@acisseJZhong made their first contribution in #2242
@angelayi made their first contribution in #2244
@insop made their first contribution in #2252
@AndrewMead10 made their first contribution in #2265
@Nicorgi made their first contribution in #2234
@jingzhaoou made their first contribution in #2331
@sam-pi made their first contribution in #2275
@bogdansalyp made their first contribution in #2367
@nathan-az made their first contribution in #2365
@rajuptvs made their first contribution in #2385

Full Changelog: v0.5.0...v0.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.6.0