Skip to content

yuanxuns/DPO-in-Pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Direct-Preference-Optimization-DPO-in-Pytorch

This repository provides a clean, reproducible implementation of Direct Preference Optimization (DPO)—a simple, stable, and effective framework for aligning language models with human preferences, without the complexities of RLHF.


What is DPO?

Direct Preference Optimization is a method introduced by Rafailov et al. (NeurIPS 2023) that streamlines preference-based fine-tuning of language models. Unlike traditional RLHF, DPO: image

  • Removes the need to train a separate reward model.
  • Avoids reinforcement learning loops and hyperparameter tuning.
  • Uses a simple binary cross‑entropy loss on preferred vs. dispreferred model outputs.
  • Matches or outperforms RLHF in tasks like sentiment control, summarization, and dialogue.

Why DPO?

  • Simplicity: Direct classification-based loss; no RL or reward model :contentReference
  • Performance: Matches or beats PPO‑based RLHF in key benchmarks
  • Efficiency: Lightweight training; fewer computations & no sampling loops

DPO Loss Module

In src/dpo_loss.py, the heart of DPO is implemented: image

    # The average loss over the batch. (pt.float)
    loss = -F.logsigmoid(
        beta * (prefered_relative_logprob - disprefered_relative_logprob)
    ).mean(dim=-1)

Training Script

bash run_training.sh

Empirical Results

Due to the limitation of a single 3060 GPU, I use SmolLM-135M-Instruct with batch size 2 on the jondurbin/truthy-dpo-v0.1 dataset.

python train.py \
    --epochs 5 \
    --batch_size 2 \
    --max_length 256 \
    --lr 1e-6 \
    --beta 0.1 \
    --seed 2003 \
    --model_name "HuggingFaceTB/SmolLM-135M-Instruct" \
    --dataset_name "jondurbin/truthy-dpo-v0.1" \
    --wandb_project "dpo"

The training is pretty stable with naive settings above. The loss, reward accuracy and reward margin converged. Screenshot from 2025-06-24 17-26-21


References:

https://arxiv.org/abs/2305.18290

https://github.com/mrunalmania/Direct-Preference-Optimization/tree/main

https://github.com/0xallam/Direct-Preference-Optimization/tree/main

About

This repository implements Direct Preference Optimization (DPO) in pytorch.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published