Skip to content

Question about the k3 KL estimator implementation #3556

@ZhangShiyue

Description

@ZhangShiyue

I was puzzled by the k3 KL estimator implementation.

It does not seem to implement the k3 estimation from John Schulman's blog. Because in John Schulman's blog, x is just a single variable. But in the context of LM, x is a sequence. If you take x being a sequence into consideration, the current implementation in TRL seems off..

I drafted some detailed derivations in https://zhangshiyue.github.io/#/blog/KL.

I wonder if I missed anything and if my confusion makes sense.

Happy to hear feedback from anyone~

Metadata

Metadata

Assignees

No one assigned

    Labels

    ❓ questionSeeking clarification or more information🏋 PPORelated to PPO

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions