pytorch
diff --git a/‎sota-implementations/grpo/README.md
Lines changed: 103 additions & 19 deletions b/‎sota-implementations/grpo/README.md
Lines changed: 103 additions & 19 deletions
diff --git a/‎sota-implementations/grpo/config/grpo_gsm8k.yaml
Lines changed: 10 additions & 10 deletions b/‎sota-implementations/grpo/config/grpo_gsm8k.yaml
Lines changed: 10 additions & 10 deletions
diff --git a/‎sota-implementations/grpo/config/grpo_ifeval.yaml
Lines changed: 7 additions & 10 deletions b/‎sota-implementations/grpo/config/grpo_ifeval.yaml
Lines changed: 7 additions & 10 deletions
diff --git a/‎sota-implementations/grpo/config/mode/async.yaml
Lines changed: 14 additions & 3 deletions b/‎sota-implementations/grpo/config/mode/async.yaml
Lines changed: 14 additions & 3 deletions
diff --git a/‎sota-implementations/grpo/config/mode/sync.yaml
Lines changed: 11 additions & 1 deletion b/‎sota-implementations/grpo/config/mode/sync.yaml
Lines changed: 11 additions & 1 deletion
@@ -37,21 +37,13 @@ export VLLM_USE_V1=0  # Required for vLLM compatibility
 
 ### Device Management
 
-There are two ways to specify device allocation:
+The number of devices for each model component is specified using `num_devices`:
 
-1. Using `num_devices` (Recommended):
 ```bash
 train_model.num_devices=2 ref_model.num_devices=2 inference_model.num_devices=2
 ```
-This approach automatically manages device allocation based on the training mode (sync/async) and prevents device conflicts.
 
-2. Using `devices` (Manual):
-```bash
-train_model.devices=[0,1] ref_model.devices=[2,3] inference_model.devices=[4,5]
-```
-This approach requires manual device management and is more error-prone.
-
-The `num_devices` approach is recommended as it:
+This approach:
 - Automatically handles device allocation
 - Works correctly in both sync and async modes
 - Prevents device conflicts between model components
@@ -71,47 +63,139 @@ There are two training modes available:
 
 #### Synchronous Mode (Default)
 ```bash
-VLLM_USE_V1=0 python sota-implementations/grpo/grpo.py train_model.num_devices=2 ref_model.num_devices=2 inference_model.num_devices=2
+VLLM_USE_V1=0 python sota-implementations/grpo/grpo-sync.py mode=sync train_model.num_devices=2 ref_model.num_devices=2 inference_model.num_devices=2
 ```
 
 #### Asynchronous Mode (Recommended)
 ```bash
-VLLM_USE_V1=0 python sota-implementations/grpo/grpo-async.py train_model.num_devices=2 ref_model.num_devices=2 inference_model.num_devices=2
+VLLM_USE_V1=0 python sota-implementations/grpo/grpo-async.py mode=async train_model.num_devices=2 ref_model.num_devices=2 inference_model.num_devices=2
+```
+
+The key difference between sync and async modes is how data collection and optimization are handled:
+
+**Synchronous Mode (grpo-sync.py)**:
+```python
+# Three nested loops:
+for data in collector:  # Data collection loop
+    for epoch in range(epochs):  # Epoch loop
+        for batch in replay_buffer:  # Buffer consumption loop
+            # Optimize on batch
+            loss = loss_fn(batch)
+            loss.backward()
+            optimizer.step()
+    # Weight updte
+    weight_updater.push_weights(policy_training)
+```
+
+**Asynchronous Mode (grpo-async.py)**:
+```python
+# Start data collection in background
+collector.start()
+
+# Single optimization loop
+for step in range(total_steps):
+    # Sample and optimize
+    batch = replay_buffer.sample()
+    loss = loss_fn(batch)
+    loss.backward()
+    optimizer.step()
+    # Update weights once in a while
+    if cond():
+      weight_updater.push_weights(policy_training)
+
 ```
 
+Key differences:
+1. **Data Collection**: 
+   - Sync: Data collection and optimization happen sequentially
+   - Async: Data collection runs in background while optimization happens
+
+2. **Buffer Size**:
+   - Sync: Buffer size must equal the batch size returned by collector (`buffer_size = steps_per_batch`)
+   - Async: Buffer can be larger than the batch size, allowing for more diverse sampling
+
+3. **Data Processing**:
+   - Sync: Processes the same data multiple times (epochs)
+   - Async: Each piece of data is processed a non-deterministic number of times.
+
+4. **Weight updates**:
+   - Sync: Weights are updated befor every collection of data
+   - Async: Weights are updated at a given interval (in gradient steps)
+
 The async mode offers better performance by:
 - Running data collection and optimization concurrently
 - More efficient GPU utilization
 - Reduced memory overhead
 - Better throughput
+- More flexible buffer management
+
+### KL Divergences in PPO: Reference vs Inference
+
+KL divergence is a key regularization term in policy optimization algorithms like PPO and in LLM post-training. It measures how much the updated policy diverges from a baseline or reference policy, helping to prevent the new policy from drifting too far and ensuring stable learning.
+
+There are two main types of KL divergences commonly used:
+
+#### 1. KL to Reference Policy (KL[ref || policy])
+- **Definition:** Measures how much the new (learned) policy diverges from a fixed reference policy (often the original, pre-trained model).
+- **Implementation:** In GRPO, this is computed as `(ref_log_prob - cur_log_prob).expm1() - (ref_log_prob - cur_log_prob)`, which is a numerically stable way to compute KL for log probabilities.
+- **Usage:**
+  - **LLM Post-Training:** This is the canonical choice in LLM post-training (e.g., RLHF, DPO, GRPO). The reference is usually the original language model before any RL fine-tuning. Penalizing KL[ref || policy] ensures the fine-tuned model stays close to the original, preserving language quality and preventing over-optimization.
+  - **Effect:** Encourages the new policy to not deviate too much from the reference, maintaining fluency and generalization.
+
+#### 2. KL to Inference Policy (KL[policy || inference])
+- **Definition:** Measures how much the current policy diverges from the policy used to generate the data (the inference policy, sometimes called the behavior policy).
+- **Implementation:** In GRPO, this is approximated as `prev_log_prob - cur_log_prob`, where `prev_log_prob` is from the inference policy that generated the data.
+- **Usage:**
+  - **Canonical PPO:** In standard PPO (especially in RL for control), this is the canonical KL: KL[policy || inference]. The inference policy is the one that generated the trajectories in the replay buffer. Penalizing this KL ensures that the updated policy does not move too far from the data distribution, stabilizing importance sampling and learning.
+  - **Effect:** Prevents the policy from making large, unstable updates relative to the data it was trained on.
+
+#### Summary Table
+| Setting            | Canonical KL Term         | Purpose                                    |
+|--------------------|--------------------------|---------------------------------------------|
+| PPO (RL control)   | KL[policy || inference]  | Stabilize updates, match data distribution  |
+| LLM Post-Training  | KL[ref || policy]        | Stay close to pre-trained model             |
+
+In GRPO, both types of KL can be used and controlled via configuration. Typically, for LLM post-training, the KL to reference is the most important for preserving model quality, while the KL to inference is more about stabilizing the optimization process.
+
+The KL contributions to the loss can be controlled via the `train.kl_to_ref_coeff` and `train.kl_to_inference_coeff`, respectively.
+
+Additionally, the KL to ref loss contribution can be either added to the reward during the grading of the LLM response, or added directly to the loss given by the `train.kl_coef_in_loss` config option.
+
+In the original GRPO paper, the KL to reference (KL[ref || policy]) is added **directly to the loss function**, not to the reward. This means that the KL penalty acts as a regularizer during optimization, discouraging the policy from drifting too far from the reference model at every update step. This is in contrast to some RLHF-style approaches, where the KL penalty is added to the reward signal during data collection (i.e., the environment's reward is modified). 
+
+**Why does this matter?**
+- **KL in the loss (as in GRPO):** The optimization explicitly balances the policy objective and the KL penalty at each gradient step, making the trade-off more direct and stable. This is the canonical approach in GRPO and is controlled by setting `train.kl_coef_in_loss=True` in the config.
+- **KL in the reward:** The KL penalty is treated as part of the environment's reward, so the policy is trained to maximize this modified reward. This can sometimes make the effect of the KL less direct, as it is mixed with the task reward during data collection.
+
+In summary, GRPO's approach of adding the KL to reference directly to the loss provides more explicit and stable regularization, and is the recommended setting for most LLM post-training scenarios.
 
 ### Run with IFEval Config
 
 ```bash
-python grpo.py --config-name grpo_ifeval
+python grpo-sync.py mode=sync --config-name grpo_ifeval
 ```
 
 ### Override Config Values
 
 ```bash
 # Change dataset
-python grpo.py env.dataset=ifeval
+python grpo-sync.py mode=sync env.dataset=ifeval
 
 # Modify training parameters
-python grpo.py optimizer.lr=2e-5 optimizer.weight_decay=0.01
+python grpo-sync.py mode=sync optimizer.lr=2e-5 optimizer.weight_decay=0.01
 
 # Change model
-python grpo.py model.name=meta-llama/Llama-2-7b-hf
+python grpo-sync.py mode=sync model.name=meta-llama/Llama-2-7b-hf
 ```
 
 ### Hyperparameter Sweeps
 
 ```bash
 # Learning rate sweep
-python grpo.py --multirun optimizer.lr=1e-4,1e-5,1e-6
+python grpo-sync.py mode=sync --multirun optimizer.lr=1e-4,1e-5,1e-6
 
 # Multiple parameters
-python grpo.py --multirun \
+python grpo-sync.py mode=sync --multirun \
   optimizer.lr=1e-4,1e-5 \
   policy.kl_coef=0.01,0.1
 ```
@@ -153,7 +237,7 @@ sota-implementations/grpo/
 ├── config/
 │   └── grpo_gsm8k.yaml       # Main configuration file
 │   └── grpo_ifeval.yaml      # config file for IFEval task
-├── grpo.py            # Synchronous training script
+├── grpo-sync.py       # Synchronous training script
 ├── grpo-async.py      # Asynchronous training script
 ├── grpo_utils.py      # Utility functions
 └── README.md          # This file
 
@@ -1,6 +1,6 @@
 # @package _global_
 defaults:
-  - mode: async  # Default to async mode, will be overridden by grpo.py
+  - mode: ${mode:async}  # Default to async mode, can be overridden by scripts
   - _self_
   - override hydra/hydra_logging: disabled
   - override hydra/job_logging: disabled
@@ -35,15 +35,19 @@ train:
 
   # Fields used by both scripts but with different semantics
   checkpoint_frequency: 100  # Save checkpoint every N steps/batches
+
+  # KL coefficients for the KL divergence to the reference and inference policies
+  kl_to_ref_coeff: 1e-2
+  kl_to_inference_coeff: 0.0
+  entropy_coeff: 0.01
 
-  # Fields used only by grpo-async.py
-  weight_update_frequency: 10  # Update policy weights every N steps
+  # Fields used only by grpo-async.py / grpo-sync.py
   logging_frequency: 10  # Log metrics every N steps
+
 # Training model configuration
 train_model:
   gradient_checkpointing: true  # Enabled for memory efficiency
   num_devices: 1  # Number of devices to use
-  devices: null  # Will be computed by compute_device_allocation
   lora:
     enabled: true  # Using LoRA for memory efficiency
     r: 8  # LoRA rank - controls capacity of adaptations
@@ -57,7 +61,6 @@ train_model:
 # Inference model configuration
 inference_model:
   num_devices: 1  # Number of devices to use
-  devices: null  # Will be computed by compute_device_allocation
   quantization:
     enabled: false  # Enable 4-bit quantization for base model
   attn_implementation: sdpa  # Using flash attention for memory efficiency
@@ -72,7 +75,6 @@ inference_model:
 ref_model:
   gradient_checkpointing: false  # Always false, no backprop
   num_devices: 1  # Number of devices to use
-  devices: null  # Will be computed by compute_device_allocation
   lora:
     enabled: true  # Using LoRA for memory efficiency
     r: 8  # LoRA rank - controls capacity of adaptations
@@ -83,16 +85,13 @@ ref_model:
   attn_implementation: sdpa  # Using flash attention for memory efficiency
   torch_dtype: bfloat16
 
-# Policy configuration
-policy:
-  kl_coef: 1e-2
-
 # Optimizer configuration
 optimizer:
   name: AdamW
   lr: 1e-5
   clip_grad_norm: 1.0
   weight_decay: 0.0
+
 # Ray configuration
 ray:
   init_config:
@@ -113,6 +112,7 @@ ray:
   replay_buffer_config:
     num_cpus: 24  # CPUs for replay buffer
     num_gpus: 0.0  # No GPU needed for replay buffer
+
 # Logging configuration
 logging:
   experiment_name: null  # Will be auto-generated if not provided
 
@@ -1,6 +1,6 @@
 # @package _global_
 defaults:
-  - mode: async  # Default to async mode, will be overridden by grpo.py
+  - mode: ${mode:async}  # Default to async mode, can be overridden by scripts
   - _self_
   - override hydra/hydra_logging: disabled
   - override hydra/job_logging: disabled
@@ -35,16 +35,19 @@ train:
 
   # Fields used by both scripts but with different semantics
   checkpoint_frequency: 100  # Save checkpoint every N steps/batches
+
+  # KL coefficients for the KL divergence to the reference and inference policies
+  kl_to_ref_coeff: 1e-2
+  kl_to_inference_coeff: 0.0
+  entropy_coeff: 0.01
 
-  # Fields used only by grpo-async.py
-  weight_update_frequency: 10  # Update policy weights every N steps
+  # Fields used only by grpo-async.py / grpo-sync.py
   logging_frequency: 10  # Log metrics every N steps
 
 # Training model configuration
 train_model:
   gradient_checkpointing: true  # Enabled for memory efficiency
   num_devices: 1  # Number of devices to use
-  devices: null  # Will be computed by compute_device_allocation
   lora:
     enabled: true  # Using LoRA for memory efficiency
     r: 8  # LoRA rank - controls capacity of adaptations
@@ -58,7 +61,6 @@ train_model:
 # Inference model configuration
 inference_model:
   num_devices: 1  # Number of devices to use
-  devices: null  # Will be computed by compute_device_allocation
   quantization:
     enabled: false  # Enable 4-bit quantization for base model
   attn_implementation: sdpa  # Using flash attention for memory efficiency
@@ -73,7 +75,6 @@ inference_model:
 ref_model:
   gradient_checkpointing: false  # Always false, no backprop
   num_devices: 1  # Number of devices to use
-  devices: null  # Will be computed by compute_device_allocation
   lora:
     enabled: true  # Using LoRA for memory efficiency
     r: 8  # LoRA rank - controls capacity of adaptations
@@ -84,10 +85,6 @@ ref_model:
   attn_implementation: sdpa  # Using flash attention for memory efficiency
   torch_dtype: bfloat16
 
-# Policy configuration
-policy:
-  kl_coef: 1e-2
-
 # Optimizer configuration
 optimizer:
   name: AdamW
 
@@ -4,11 +4,22 @@ train:
   sync: false  # Force asynchronous mode
 
   # Shared training settings
+  # Whether to use mixed precision training.
   mixed_precision: true
+  # Number of epochs to train for, every time a batch is collected. Per se, not directly used in async - aside from computing the total number of steps.
   epochs: 1
+  # Number of steps in each batch. Higher values will cause the inference step to be slower, but won't use more GPU memory.
   steps_per_batch: 16
-  buffer_size: 32
+  # Leave buffer_size empty to use steps_per_batch in async mode
+  buffer_size:
+  # Total number of dialog turns to collect during training.
   total_dialog_turns: 100_000
-  optim_batch_size: 4
+  # Batch size for optimization. Higher values will use more GPU memory.
+  optim_batch_size: 1
+  # Number of gradient accumulation steps. Higher values will use less GPU memory (comparing with bigger batches and lower gradient_accumulation_steps), 
+  # but will make the optimization step slower.
   gradient_accumulation_steps: 4
-  kl_coef_in_loss: false 
+  # Whether to include the KL coefficient in the loss function. Alternatively, the KL ref-to-train will be added to the reward.
+  kl_coef_in_loss: true 
+  # Update policy weights every N steps - can be set to any positive integer in async mode
+  weight_update_frequency: 10
@@ -4,12 +4,22 @@ train:
   sync: true  # Force synchronous mode
 
   # Shared training settings
+  # Whether to use mixed precision training.
   mixed_precision: true
+  # Number of epochs to train for, every time a batch is collected.
   epochs: 1
-  steps_per_batch: 32
+  # Number of steps in each batch. Higher values will cause the inference step to be slower, but won't use more GPU memory.
+  steps_per_batch: 64
   # Leave buffer_size empty to use steps_per_batch in sync mode
   buffer_size:
+  # Total number of dialog turns to collect during training.
   total_dialog_turns: 100_000
+  # Batch size for optimization. Higher values will use more GPU memory.
   optim_batch_size: 1
+  # Number of gradient accumulation steps. Higher values will use less GPU memory (comparing with bigger batches and lower gradient_accumulation_steps), 
+  # but will make the optimization step slower.
   gradient_accumulation_steps: 1
+  # Whether to include the KL coefficient in the loss function. Alternatively, the KL ref-to-train will be added to the reward.
   kl_coef_in_loss: true 
+  # Update policy weights every N steps - must be left empty in sync mode
+  weight_update_frequency: