Minor DPO fixes #617

casinca · 2025-04-13T20:25:58Z

Sensei, this PR concerns some minor potential fixes:

Removed unused label_smoothing argument in the docstring of def compute_dpo_loss.
I believe you initially experimented with conservative DPO but there was a remnant left in the docstring.

Alternatively we could add it back initialized label_smoothing=0, for the original DPO behavior.
Then change the:
losses = -F.logsigmoid(beta * logits)
to
losses = -F.logsigmoid(beta * logits) * (1 - label_smoothing) - F.logsigmoid(-beta * logits) * label_smoothing

reference# Eq. 3 https://ericmitchell.ai/cdpo.pdf
Potential typo depending on the intent in def compute_logprobs().
avg_log_prob resulting shape would be (batch_size,) and not (batch_size, num_tokens) concerning the comment "This averages over the tokens, so the shape is", ie the avg log proba per sequence.

If you meant the initial shape before averaging then please ignore it.
Simplified def train_model_dpo_simple()'s to a simple for loop. Enumerate() was used but indexes were never used.

metadata diff

review-notebook-app · 2025-04-13T20:26:04Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

rasbt · 2025-04-16T23:06:35Z

This looks good, thanks for the PR! The label smoothing was a leftover I forgot to remove when I simplified the code.

casinca added 2 commits April 13, 2025 21:33

minor dpo fixes

497ad23

Update dpo-from-scratch.ipynb

c2dab9b

metadata diff

rasbt merged commit 1b242d0 into rasbt:main Apr 16, 2025
13 checks passed

Provide feedback