v0.6.0
Highlights
We are releasing torchtune v0.6.0 with exciting new features and improved distributed training support! This release includes TensorParallel (TP) + FSDP training, TP inference, multinode training, and a full DPO distributed recipe. We also landed Phi4, logging with MLFlow, and improved support for NPUs.
Tensor Parallel training + inference (#2245) (#2330)
Tensor parallelism is a model parallelism technique for distributed training. When combined with FSDP, TP allows more efficient training of large models across many GPUs versus FSDP alone. While FSDP splits your dataset across different GPUs, TP splits each model layer across GPUs, allowing model layers to be computed much faster at larger scales. In addition to training, we've also enabled TP inference, which is crucial for generating text or doing reinforcement learning when your model doesn't fit on a single GPU. To learn more on how to define a TP model, take a look here.
Multinode training support (#2301)
Multinode finetuning is now supported, allowing you to train larger models faster. Using SLURM you can launch tune run
across multiple nodes and train just as you would now on a single machine. We include one example slurm recipe and a tutorial for getting started here.
Full Distributed DPO recipe (#2275)
We've had DPO support for some time but we've now added the ability to train DPO using all of the distributed goodies that we've had and those listed above. This improves our coverage of recipes that you can use on the increasing number of 70B+ models. To finetune Llama 3.1 8B with Full Distributed DPO, you can run:
# Download Llama 3.1 8B
tune download meta-llama/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth"
# Finetune on four devices
tune run --nnodes 1 --nproc_per_node 4 full_dpo_distributed --config llama3_1/8B_full_dpo
A special thanks to @sam-pi for adding this recipe.
Phi 4 models (#1835)
We now support Phi 4! This includes the 14B model for now, with recipes for full, LoRA, and QLoRA finetuning on one or more devices. For example, you can full finetune Phi 4 14B on a single GPU by running:
# Download Phi 4 14B
tune download microsoft/phi-4
# pip install bits and bytes
pip install bitsandbytes
# Finetune on a single GPU
tune run full_finetune_single_device --config phi4/14B_full_low_memory
A huge thanks to @krammnic for landing these models!
Improved NPU support (#2234)
We are continuing to improve our support for Ascend NPU devices. This release includes fixes and enhancements to give you better performance with the NPU backend. Thank you to @Nicorgi for the help!
What's Changed
- Small readme, config updates by @ebsmothers in #2157
- Using
FormattedCheckpointFiles
in configs by @SalmanMohammadi in #2147 - Move
get_world_size_and_rank
to utils by @joecummings in #2155 - Faster intermediate checkpoints with DCP async save in TorchTune by @saumishr in #2006
- torchdata integration - multi-dataset and streaming support by @andrewkho in #1929
- Allow higher version of lm-eval by @joecummings in #2165
- Using
FormattedCheckpointFiles
in configs... round 2 by @SalmanMohammadi in #2167 - [EZ] Fix set_torch_num_threads in multi-node. by @EugenHotaj in #2164
- Fix
adapter_config.json
saving in DPO recipes by @SalmanMohammadi in #2162 - Fix excessive QAT warning by @andrewor14 in #2174
- Add output dir to top of all configs by @ebsmothers in #2183
- change saving logic by @felipemello1 in #2182
- output_dir not in ckpt dir by @felipemello1 in #2181
- Set teacher ckptr output_dir to match student in KD configs by @ebsmothers in #2185
- raise compile error by @felipemello1 in #2188
- Update DPO Max Seq Len by @pbontrager in #2176
- Llama3.2 3B eval by @ReemaAlzaid in #2186
- Update typo in docstring for _generation.get_causal_mask_from_padding⦠by @psoulos in #2187
- new docs for checkpointing by @felipemello1 in #2189
- Update E2E Tutorial w/ vLLM and HF Hub by @joecummings in #2192
- pytorch/torchtune/tests/torchtune/modules/_export by @gmagogsfm in #2179
- update torchtune version by @felipemello1 in #2195
- [metric_logging][wandb] Fix wandb metric logger config save path by @akashc1 in #2196
- Add evaluation file for code_llama2 model by @ReemaAlzaid in #2209
- Adds message_transform link from SFTDataset docstring to docs by @thomasjpfan in #2219
- Change alpaca_dataset train_on_input doc to match default value by @mirceamironenco in #2227
- Set default value for 'subset' parameter in the_cauldron_dataset by @Ankur-singh in #2228
- Add eval config for QWEN2_5 model using 0.5B variant by @Ankur-singh in #2230
- T5 Encoder by @calvinpelletier in #2069
- Migrate distributed state dict API by @mori360 in #2138
- Flux Autoencoder by @calvinpelletier in #2098
- Fix gradient scaling to account for world_size normalization by @mirceamironenco in #2172
- [Small fix] Update CUDA version in README by @acisseJZhong in #2242
- Adds clip_grad_norm to all recipe config that supports it by @thomasjpfan in #2220
- llama 3.1 has correct
max_seq_len
for all versions by @akashc1 in #2203 - Log grad norm aggregated over all ranks, not just rank zero by @ebsmothers in #2248
- Remove example inputs from aoti_compile_and_package by @angelayi in #2244
- Fix issue #2243, update the document to show correct usage by @insop in #2252
- [EZ] Fix config bug where interpolation happens too early by @EugenHotaj in #2236
- Small formatting fix by @krammnic in #2256
- Multi-tile support in vision rope by @RdoubleA in #2247
- Add AlpacaToMessages to message transforms doc page by @AndrewMead10 in #2265
- Add a "division by zero" check in chunked loss handling in kd_losses.py by @insop in #2239
- Fixing docstring linter by @SalmanMohammadi in #2163
- PPO Performance Improvements by @SalmanMohammadi in #2066
- Add Ascend NPU as a backend for single device recipes by @Nicorgi in #2234
- Fix tests due to upgrade to cuda126 by @acisseJZhong in #2260
- Fix a bug in set float32 precision by @Nicorgi in #2271
- Construct EarlyFusion's encoder_token_ids on correct device by @ebsmothers in #2276
- Sample packing for ConcatDataset by @ebsmothers in #2278
- Added Distributed(Tensor Parallel) Inference Recipe by @acisseJZhong in #2245
- Logging resolved config by @Ankur-singh in #2274
- Removing
SimPOLoss
by @SalmanMohammadi in #2290 - Proper prefix handling in EarlyFusion sd hooks by @ebsmothers in #2291
- Remove deprecated components for 0.6.0 by @RdoubleA in #2293
- Update the e2e flow tutorial to fix errors of generate by @iseeyuan in #2251
- profiling ops on xpu by @songhappy in #2249
- Refactored modules/tokenizers to be a subdir of modules/transforms by @Ankur-singh in #2231
- Update model builders by @Ankur-singh in #2282
- [EZ] Only log deprecation warning on rank zero by @RdoubleA in #2308
- [ez] Add output_dir field to a couple configs by @ebsmothers in #2309
- Disable DSD and fix bitsandbytes test by @RdoubleA in #2314
- fix state dict hook for early fusion models by @acisseJZhong in #2317
- Adding reverse and symmetric KLD losses by @insop in #2094
- [WIP] 'tune cat' command for pretty printing configuration files by @Ankur-singh in #2298
- Use checkout@v4 / upload@v4 for docs build by @joecummings in #2322
- Fix stop tokens in PPO by @RedTachyon in #2304
- Update PT pin for modules/_export by @Jack-Khuu in #2336
- Update to proper EOS ids for Qwen2 and Qwen2.5 by @joecummings in #2342
- Multinode support in torchtune by @joecummings in #2301
- [Bug Fix]Disable DSD for saving ckpt by @acisseJZhong in #2346
- Update README for multinode by @joecummings in #2348
- added
tie_word_embeddings
to llama3_2 models by @jingzhaoou in #2331 - Fix saving adapter weights after disabling DSD by @acisseJZhong in #2351
- Remove "ft-" prefix from checkpoint shards. by @EugenHotaj in #2354
- Full DPO Distributed by @sam-pi in #2275
- [Fix Test] Fix failed generation test by pining pytorch nightlies by @acisseJZhong in #2362
- TP + FSDP distributed training (full finetuning) by @acisseJZhong in #2330
- Add max-autotune try/except if flex attn breaks by @felipemello1 in #2357
- readme updates for full DPO distributed recipe by @ebsmothers in #2363
- Fix Qwen config by @acisseJZhong in #2377
- feat: Added cfg.cudnn_deterministic_mode flag by @bogdansalyp in #2367
- Add Phi4 by @krammnic in #2197
- Add tests and implementation for disabling dropout layers in models by @Ankur-singh in #2378
- nit: Phi4 to readme by @krammnic in #2383
- Implements MLFlowLogger by @nathan-az in #2365
- 'ft-' prefix occurrence removal by @rajuptvs in #2385
- check if log_dir is not none by @felipemello1 in #2389
- HF tokenizers: initial base tokenizer support by @ebsmothers in #2350
- Simplify README and prominently display recipes by @joecummings in #2349
- Renamed parallelize_plan to tensor_parallel_plan by @pbontrager in #2387
- Fix optimizer_in_backward at loading opt_state_dict in distributed recipes by @mori360 in #2390
- Add core dependency on stable torchdata (#2408) by @pbontrager in #2509
New Contributors
- @saumishr made their first contribution in #2006
- @andrewkho made their first contribution in #1929
- @EugenHotaj made their first contribution in #2164
- @ReemaAlzaid made their first contribution in #2186
- @psoulos made their first contribution in #2187
- @gmagogsfm made their first contribution in #2179
- @akashc1 made their first contribution in #2196
- @acisseJZhong made their first contribution in #2242
- @angelayi made their first contribution in #2244
- @insop made their first contribution in #2252
- @AndrewMead10 made their first contribution in #2265
- @Nicorgi made their first contribution in #2234
- @jingzhaoou made their first contribution in #2331
- @sam-pi made their first contribution in #2275
- @bogdansalyp made their first contribution in #2367
- @nathan-az made their first contribution in #2365
- @rajuptvs made their first contribution in #2385
Full Changelog: v0.5.0...v0.6.0