-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
By now, GRPO (pioneered by deepseek) is widely used. There are probably several implementations and variations that one can have a look at for starting out. It's also a fairly simple algorithm.
The implementation should be analogous to other on-policy algorithms.
One should explore research papers on that topic and consider which kind of variations of GRPO we want to support from the start.
Performance should be tested and compared to PPO.
After an implementation and first tests with the low-level API, it should also be included in the high-level API. That can happen in a separate follow-up PR.
@arnaujc91 will take this issue
arnaujc91
Metadata
Metadata
Assignees
Labels
No labels