generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Labels
❓ questionSeeking clarification or more informationSeeking clarification or more information🏋 PPORelated to PPORelated to PPO
Description
I was puzzled by the k3 KL estimator implementation.
It does not seem to implement the k3 estimation from John Schulman's blog. Because in John Schulman's blog, x is just a single variable. But in the context of LM, x is a sequence. If you take x being a sequence into consideration, the current implementation in TRL seems off..
I drafted some detailed derivations in https://zhangshiyue.github.io/#/blog/KL.
I wonder if I missed anything and if my confusion makes sense.
Happy to hear feedback from anyone~
Metadata
Metadata
Assignees
Labels
❓ questionSeeking clarification or more informationSeeking clarification or more information🏋 PPORelated to PPORelated to PPO