Skip to content

Reward-KL Comparison #27

@vincezh2000

Description

@vincezh2000

Question about KL Divergence Evaluation in DPO Implementation

I read the paper "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint" and noticed your comparison of different methods and iterations using reward-KLD metrics.

In the implementation, I see that during training, you compute the KL divergence in DPO as follows:

# Compute policy log-ratios
pi_logratios = policy_chosen_logps - policy_rejected_logps

# Handle reference model computations
if reference_free:
    ref_logratios = torch.tensor([0], dtype=pi_logratios.dtype, device=pi_logratios.device)
else:
    ref_logratios = reference_chosen_logps - reference_rejected_logps

# Move tensors to device
pi_logratios = pi_logratios.to(device)
ref_logratios = ref_logratios.to(device)

# Compute final preference logits
logits = pi_logratios - ref_logratios

I have a few questions about how KLD is evaluated after training:

  1. What is your evaluation methodology for computing KLD? Do you:

    • Sample multiple responses for each input?
    • Average KLD across these samples?
  2. Regarding the evaluation inputs:

    • Do you use different inputs for each response?
    • Is there a specific sampling strategy for the inputs?

This information would help clarify the practical aspects of implementing and evaluating the KL divergence constraints described in the paper.

Thanks!

image
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions