-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
Question about KL Divergence Evaluation in DPO Implementation
I read the paper "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint" and noticed your comparison of different methods and iterations using reward-KLD metrics.
In the implementation, I see that during training, you compute the KL divergence in DPO as follows:
# Compute policy log-ratios
pi_logratios = policy_chosen_logps - policy_rejected_logps
# Handle reference model computations
if reference_free:
ref_logratios = torch.tensor([0], dtype=pi_logratios.dtype, device=pi_logratios.device)
else:
ref_logratios = reference_chosen_logps - reference_rejected_logps
# Move tensors to device
pi_logratios = pi_logratios.to(device)
ref_logratios = ref_logratios.to(device)
# Compute final preference logits
logits = pi_logratios - ref_logratios
I have a few questions about how KLD is evaluated after training:
-
What is your evaluation methodology for computing KLD? Do you:
- Sample multiple responses for each input?
- Average KLD across these samples?
-
Regarding the evaluation inputs:
- Do you use different inputs for each response?
- Is there a specific sampling strategy for the inputs?
This information would help clarify the practical aspects of implementing and evaluating the KL divergence constraints described in the paper.
Thanks!
Metadata
Metadata
Assignees
Labels
No labels