diff --git a/docs/source/index.rst b/docs/source/index.rst index 318c82b3e2..d62ad77b63 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -113,6 +113,7 @@ torchtune tutorials. recipes/recipes_overview recipes/lora_finetune_single_device recipes/qat_distributed + recipes/dpo .. toctree:: :glob: diff --git a/docs/source/recipes/dpo.rst b/docs/source/recipes/dpo.rst new file mode 100644 index 0000000000..5fdb455a35 --- /dev/null +++ b/docs/source/recipes/dpo.rst @@ -0,0 +1,75 @@ +.. _dpo_recipe_label: + +==================================== +Direct Preference Optimization +==================================== + +This recipe supports several `Direct Preference Optimization `_ (DPO)-style fine-tuning techniques. +These techniques aim to steer (or `align `_) a model towards some desirable behaviours. +For example, a common goal is to train language models to produce safe and honest outputs, +or to be `helpful and harmless `_. + +To see the best results when using this recipe, it may be helpful to first fine-tune your model with using supervised fine-tuning to ensure your model is +on-distribution for the domain you're interested in. To do this, check out our other fine-tuning recipes in the :ref:`recipe overview ` which +support a variety of SFT paradigms. + +After supervised fine-tuning, here is an example of DPO with Llama 3.1 8B: + +.. note:: + + You may need to be granted access to the Llama model you're interested in. See + :ref:`here ` for details on accessing gated repositories. + + +.. code-block:: bash + + tune download meta-llama/Meta-Llama-3.1-8B-Instruct \ + --ignore-patterns "original/consolidated.00.pth" + --HF_TOKEN + + # run on a single device + tune run lora_dpo_single_device --config llama3_1/8B_lora_dpo_single_device + + # run on two gpus + tune run --nproc_per_node 2 lora_dpo_distributed --config llama3_1/8B_lora_dpo + +It's easy to get started with this recipe with your dataset of choice, including custom local datasets, +and datasets from Hugging Face. Check out our primer on :ref:`preference datasets ` to +see how to do this. + +For this recipe we include different DPO-style losses: + +* :class:`Direct Preference Optimization ` (DPO) loss [#]_. The DPO loss function + increases the relative log-probabilities of preferred to un-preferred responses, whilst using log probabilities + from a reference model to prevent policy degradation during training. Alongside RLHF, this is the most commonly used + alignment technique and is used to train a growing number of state-of-the-art LLMs e.g. Llama3.1, Gemma 2, Qwen2, etc. + This is a good starting point for alignment fine-tuning. +* :class:`Statistical Rejection Sampling Optimization ` (RSO) or "hinge" loss [#]_. + RSO builds on concepts from support vector machines and DPO, applying a margin-based approach that penalizes + low-quality responses while ensuring a significant gap between chosen and un-chosen log probabilities. + +To use any of these, simply use the ``loss`` config entry or flag through the :ref:`cli_label`: + +.. code-block:: bash + + tune run lora_dpo_single_device --config llama2/7B_lora_dpo_single_device \ + loss=torchtune.modules.loss.RSOLoss \ + gamma=0.5 + +.. todo (@SalmanMohammadi) point to an example repo for SimPO + +For a deeper understanding of the different levers you can pull when using this recipe, +see our documentation for the different PEFT training paradigms we support: + +* :ref:`glossary_lora` +* :ref:`glossary_qlora` +* :ref:`glossary_dora` + +Many of our other memory optimization features can be used in this recipe. You can learn more about all of our memory optimization features in our :ref:`memory optimization overview`. + +.. rubric:: References: + +.. [#] Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S. and Finn, C., 2024. + Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36. +.. [#] Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P.J. and Liu, J., 2023. + Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657. diff --git a/docs/source/recipes/lora_finetune_single_device.rst b/docs/source/recipes/lora_finetune_single_device.rst index 4b4d476058..ffcca11d53 100644 --- a/docs/source/recipes/lora_finetune_single_device.rst +++ b/docs/source/recipes/lora_finetune_single_device.rst @@ -8,7 +8,7 @@ This recipe supports finetuning on next-token prediction tasks using parameter e such as :ref:`glossary_lora` and :ref:`glossary_qlora`. These techniques significantly reduce memory consumption during training whilst still maintaining competitive performance. -We provide configs which you can get up and running quickly. Here is an example with llama 3.1 8B: +We provide configs which you can get up and running quickly. Here is an example with Llama 3.1 8B: .. note:: diff --git a/docs/source/recipes/recipes_overview.rst b/docs/source/recipes/recipes_overview.rst index a1c4f39ef3..e6e8c9cd63 100644 --- a/docs/source/recipes/recipes_overview.rst +++ b/docs/source/recipes/recipes_overview.rst @@ -28,7 +28,7 @@ Our recipes include: * Single-device full fine-tuning * Distributed full fine-tuning * Distributed LoRA fine-tuning -* Direct Preference Optimization (DPO) +* :ref:`Direct Preference Optimization (DPO) ` * Proximal Policy Optimization (PPO) * :ref:`Distributed Quantization-Aware Training (QAT)`.