You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The key difference between sync and async modes is how data collection and optimization are handled:
75
+
76
+
**Synchronous Mode (grpo-sync.py)**:
77
+
```python
78
+
# Three nested loops:
79
+
for data in collector: # Data collection loop
80
+
for epoch inrange(epochs): # Epoch loop
81
+
for batch in replay_buffer: # Buffer consumption loop
82
+
# Optimize on batch
83
+
loss = loss_fn(batch)
84
+
loss.backward()
85
+
optimizer.step()
86
+
# Weight updte
87
+
weight_updater.push_weights(policy_training)
88
+
```
89
+
90
+
**Asynchronous Mode (grpo-async.py)**:
91
+
```python
92
+
# Start data collection in background
93
+
collector.start()
94
+
95
+
# Single optimization loop
96
+
for step inrange(total_steps):
97
+
# Sample and optimize
98
+
batch = replay_buffer.sample()
99
+
loss = loss_fn(batch)
100
+
loss.backward()
101
+
optimizer.step()
102
+
# Update weights once in a while
103
+
if cond():
104
+
weight_updater.push_weights(policy_training)
105
+
80
106
```
81
107
108
+
Key differences:
109
+
1.**Data Collection**:
110
+
- Sync: Data collection and optimization happen sequentially
111
+
- Async: Data collection runs in background while optimization happens
112
+
113
+
2.**Buffer Size**:
114
+
- Sync: Buffer size must equal the batch size returned by collector (`buffer_size = steps_per_batch`)
115
+
- Async: Buffer can be larger than the batch size, allowing for more diverse sampling
116
+
117
+
3.**Data Processing**:
118
+
- Sync: Processes the same data multiple times (epochs)
119
+
- Async: Each piece of data is processed a non-deterministic number of times.
120
+
121
+
4.**Weight updates**:
122
+
- Sync: Weights are updated befor every collection of data
123
+
- Async: Weights are updated at a given interval (in gradient steps)
124
+
82
125
The async mode offers better performance by:
83
126
- Running data collection and optimization concurrently
84
127
- More efficient GPU utilization
85
128
- Reduced memory overhead
86
129
- Better throughput
130
+
- More flexible buffer management
131
+
132
+
### KL Divergences in PPO: Reference vs Inference
133
+
134
+
KL divergence is a key regularization term in policy optimization algorithms like PPO and in LLM post-training. It measures how much the updated policy diverges from a baseline or reference policy, helping to prevent the new policy from drifting too far and ensuring stable learning.
135
+
136
+
There are two main types of KL divergences commonly used:
137
+
138
+
#### 1. KL to Reference Policy (KL[ref || policy])
139
+
-**Definition:** Measures how much the new (learned) policy diverges from a fixed reference policy (often the original, pre-trained model).
140
+
-**Implementation:** In GRPO, this is computed as `(ref_log_prob - cur_log_prob).expm1() - (ref_log_prob - cur_log_prob)`, which is a numerically stable way to compute KL for log probabilities.
141
+
-**Usage:**
142
+
-**LLM Post-Training:** This is the canonical choice in LLM post-training (e.g., RLHF, DPO, GRPO). The reference is usually the original language model before any RL fine-tuning. Penalizing KL[ref || policy] ensures the fine-tuned model stays close to the original, preserving language quality and preventing over-optimization.
143
+
-**Effect:** Encourages the new policy to not deviate too much from the reference, maintaining fluency and generalization.
144
+
145
+
#### 2. KL to Inference Policy (KL[policy || inference])
146
+
-**Definition:** Measures how much the current policy diverges from the policy used to generate the data (the inference policy, sometimes called the behavior policy).
147
+
-**Implementation:** In GRPO, this is approximated as `prev_log_prob - cur_log_prob`, where `prev_log_prob` is from the inference policy that generated the data.
148
+
-**Usage:**
149
+
-**Canonical PPO:** In standard PPO (especially in RL for control), this is the canonical KL: KL[policy || inference]. The inference policy is the one that generated the trajectories in the replay buffer. Penalizing this KL ensures that the updated policy does not move too far from the data distribution, stabilizing importance sampling and learning.
150
+
-**Effect:** Prevents the policy from making large, unstable updates relative to the data it was trained on.
| PPO (RL control) | KL[policy || inference]| Stabilize updates, match data distribution |
156
+
| LLM Post-Training | KL[ref || policy]| Stay close to pre-trained model |
157
+
158
+
In GRPO, both types of KL can be used and controlled via configuration. Typically, for LLM post-training, the KL to reference is the most important for preserving model quality, while the KL to inference is more about stabilizing the optimization process.
159
+
160
+
The KL contributions to the loss can be controlled via the `train.kl_to_ref_coeff` and `train.kl_to_inference_coeff`, respectively.
161
+
162
+
Additionally, the KL to ref loss contribution can be either added to the reward during the grading of the LLM response, or added directly to the loss given by the `train.kl_coef_in_loss` config option.
163
+
164
+
In the original GRPO paper, the KL to reference (KL[ref || policy]) is added **directly to the loss function**, not to the reward. This means that the KL penalty acts as a regularizer during optimization, discouraging the policy from drifting too far from the reference model at every update step. This is in contrast to some RLHF-style approaches, where the KL penalty is added to the reward signal during data collection (i.e., the environment's reward is modified).
165
+
166
+
**Why does this matter?**
167
+
-**KL in the loss (as in GRPO):** The optimization explicitly balances the policy objective and the KL penalty at each gradient step, making the trade-off more direct and stable. This is the canonical approach in GRPO and is controlled by setting `train.kl_coef_in_loss=True` in the config.
168
+
-**KL in the reward:** The KL penalty is treated as part of the environment's reward, so the policy is trained to maximize this modified reward. This can sometimes make the effect of the KL less direct, as it is mixed with the task reward during data collection.
169
+
170
+
In summary, GRPO's approach of adding the KL to reference directly to the loss provides more explicit and stable regularization, and is the recommended setting for most LLM post-training scenarios.
0 commit comments