Reward-KL Comparison

### Question about KL Divergence Evaluation in DPO Implementation

I read the paper ["Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint"](link_to_paper) and noticed your comparison of different methods and iterations using reward-KLD metrics. 

In the implementation, I see that during training, you compute the KL divergence in DPO as follows:
```python
# Compute policy log-ratios
pi_logratios = policy_chosen_logps - policy_rejected_logps

# Handle reference model computations
if reference_free:
    ref_logratios = torch.tensor([0], dtype=pi_logratios.dtype, device=pi_logratios.device)
else:
    ref_logratios = reference_chosen_logps - reference_rejected_logps

# Move tensors to device
pi_logratios = pi_logratios.to(device)
ref_logratios = ref_logratios.to(device)

# Compute final preference logits
logits = pi_logratios - ref_logratios
```

I have a few questions about how KLD is evaluated after training:

1. What is your evaluation methodology for computing KLD? Do you:
   - Sample multiple responses for each input?
   - Average KLD across these samples?

2. Regarding the evaluation inputs:
   - Do you use different inputs for each response?
   - Is there a specific sampling strategy for the inputs?

This information would help clarify the practical aspects of implementing and evaluating the KL divergence constraints described in the paper.

Thanks!

![image](https://github.com/user-attachments/assets/2fd62ede-31dc-479d-9827-0f247742589c)
<img width="1019" alt="image" src="https://github.com/user-attachments/assets/c4593ae3-f9f8-4d67-bf02-f9cc86fe501f">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reward-KL Comparison #27

Question about KL Divergence Evaluation in DPO Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reward-KL Comparison #27

Description

Question about KL Divergence Evaluation in DPO Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions