Out of Memory Error When Validation and Checkpointing Periods Are Misaligned

**Describe the bug**

NeMo RL encounters a CUDA out of memory error when val_period and checkpointing.save_period are configured with different values, specifically when checkpointing occurs more frequently than validation (I didn't try different configurations)

**Steps/Code to reproduce bug**

Working Configuration

```yaml
grpo:
  val_period: 20
...
checkpointing:
  save_period: 20
```
This configuration runs without memory issues.

```yaml
grpo:
  val_period: 20
...
checkpointing:
  save_period: 10
```
This configuration causes CUDA out of memory errors at step 21 (immediately after the second validation at step 20).


**Timeline Analysis**
From the logs, the sequence of events is:

Step 10: Checkpointing occurs, memory is managed properly
Steps 11-19: Training continues normally
Step 20: Validation runs
Step 20: Another checkpoint is saved after validation
Step 21: Attempt to start next batch generation fails with OOM

Here are highlights of the full log:

```bash
...

========================= Step 10/10078 =========================
▶ Preparing batch...
▶ Generating responses for batch of size 32...
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 09:53:11 [block_pool.py:321] Successfully reset prefix cache [repeated 7x across cluster]
(MegatronPolicyWorker[rank=8] pid=4950, ip=100.64.0.14) GPU Memory before optimizer offload: 28.74GB allocated, 28.76GB reserved [repeated 33x across cluster]
(RayWorkerWrapper pid=3032, ip=100.64.0.9) INFO 09-16 09:53:13 [gpu_worker.py:104] Sleep mode freed 124.54 GiB memory, 7.95 GiB memory is still in use. [repeated 31x across cluster]
(VllmGenerationWorker pid=2243, ip=100.64.0.9) INFO 09-16 09:53:13 [executor_base.py:187] It took 1.904843 seconds to fall asleep. [repeated 3x across cluster]
(MegatronPolicyWorker[rank=4] pid=18088) GPU Memory after optimizer offload: 4.43GB allocated, 4.53GB reserved [repeated 31x across cluster]
(MegatronPolicyWorker[rank=4] pid=18088) GPU Memory before optimizer offload: 30.77GB allocated, 30.79GB reserved [repeated 31x across cluster]
(MegatronPolicyWorker[rank=16] pid=5243, ip=100.64.0.9) GPU Memory after optimizer offload: 4.14GB allocated, 4.30GB reserved
(MegatronPolicyWorker[rank=22] pid=5536, ip=100.64.0.9) GPU Memory after optimizer offload: 4.14GB allocated, 4.21GB reserved
(MegatronPolicyWorker[rank=20] pid=5538, ip=100.64.0.9) GPU Memory after optimizer offload: 4.14GB allocated, 4.30GB reserved [repeated 30x across cluster]
(VllmGenerationWorker pid=14172) INFO 09-16 09:54:05 [executor_base.py:203] It took 0.724005 seconds to wake up tags ['weights'].
[Refit] Split 563 keys into 7 groups
Adding requests: 100%|██████████| 8/8 [00:00<00:00, 10436.84it/s]
Processed prompts:   0%|          | 0/8 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:  12%|█▎        | 1/8 [00:07<00:53,  7.65s/it, est. speed input: 10.20 toks/s, output: 29.16 toks/s]
Adding requests: 100%|██████████| 8/8 [00:00<00:00, 10286.46it/s] [repeated 3x across cluster]
Processed prompts:   0%|          | 0/8 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] [repeated 3x across cluster]
Processed prompts:  25%|██▌       | 2/8 [00:08<00:21,  3.64s/it, est. speed input: 18.39 toks/s, output: 55.42 toks/s]
Processed prompts:  38%|███▊      | 3/8 [00:10<00:14,  2.85s/it, est. speed input: 22.52 toks/s, output: 74.96 toks/s]
Processed prompts:  50%|█████     | 4/8 [00:10<00:07,  1.78s/it, est. speed input: 29.60 toks/s, output: 103.70 toks/s]
Processed prompts:  62%|██████▎   | 5/8 [00:11<00:04,  1.62s/it, est. speed input: 32.88 toks/s, output: 122.34 toks/s]
Processed prompts:  75%|███████▌  | 6/8 [00:13<00:03,  1.74s/it, est. speed input: 33.82 toks/s, output: 135.37 toks/s]
Processed prompts:  88%|████████▊ | 7/8 [00:14<00:01,  1.28s/it, est. speed input: 38.53 toks/s, output: 162.72 toks/s]
Processed prompts: 100%|██████████| 8/8 [00:16<00:00,  2.03s/it, est. speed input: 38.41 toks/s, output: 172.86 toks/s]
Processed prompts:  12%|█▎        | 1/8 [00:19<02:16, 19.52s/it, est. speed input: 7.58 toks/s, output: 32.58 toks/s] [repeated 2x across cluster]
Processed prompts:  12%|█▎        | 1/8 [00:27<03:15, 27.87s/it, est. speed input: 5.92 toks/s, output: 31.50 toks/s] [repeated 7x across cluster]
Processed prompts:  50%|█████     | 4/8 [00:32<00:20,  5.12s/it, est. speed input: 20.03 toks/s, output: 117.20 toks/s] [repeated 8x across cluster]
Processed prompts: 100%|██████████| 8/8 [00:34<00:00,  4.37s/it, est. speed input: 33.85 toks/s, output: 209.39 toks/s]
Processed prompts:  88%|████████▊ | 7/8 [00:40<00:03,  3.21s/it, est. speed input: 28.69 toks/s, output: 185.36 toks/s] [repeated 3x across cluster]
Processed prompts: 100%|██████████| 8/8 [00:45<00:00,  5.70s/it, est. speed input: 28.93 toks/s, output: 195.37 toks/s]
Processed prompts: 100%|██████████| 8/8 [00:57<00:00,  7.18s/it, est. speed input: 21.31 toks/s, output: 125.62 toks/s]
(MegatronPolicyWorker[rank=8] pid=4950, ip=100.64.0.14) GPU Memory before optimizer offload: 6.59GB allocated, 72.05GB reserved
(MegatronPolicyWorker[rank=24] pid=6280, ip=100.64.0.8) GPU Memory after optimizer offload: 6.78GB allocated, 22.29GB reserved [repeated 5x across cluster]
(MegatronPolicyWorker[rank=8] pid=4950, ip=100.64.0.14) GPU Memory after refit complete: 0.05GB allocated, 0.07GB reserved
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 09:54:05 [executor_base.py:203] It took 1.009648 seconds to wake up tags ['weights']. [repeated 3x across cluster]
(VllmGenerationWorker pid=2566, ip=100.64.0.8) INFO 09-16 09:54:10 [executor_base.py:203] It took 0.180229 seconds to wake up tags ['kv_cache'].
(VllmGenerationWorker pid=2566, ip=100.64.0.8) INFO 09-16 09:55:08 [block_pool.py:321] Successfully reset prefix cache
(MegatronPolicyWorker[rank=4] pid=18088) GPU Memory before optimizer offload: 6.59GB allocated, 72.65GB reserved [repeated 31x across cluster]
(MegatronPolicyWorker[rank=20] pid=5538, ip=100.64.0.9) GPU Memory after optimizer offload: 6.59GB allocated, 22.10GB reserved [repeated 27x across cluster]
(MegatronPolicyWorker[rank=20] pid=5538, ip=100.64.0.9) GPU Memory after refit complete: 0.05GB allocated, 0.21GB reserved [repeated 31x across cluster]
(VllmGenerationWorker pid=2243, ip=100.64.0.9) INFO 09-16 09:54:10 [executor_base.py:203] It took 0.280656 seconds to wake up tags ['kv_cache']. [repeated 3x across cluster]
(VllmGenerationWorker pid=2566, ip=100.64.0.8) INFO 09-16 09:55:08 [block_pool.py:321] Successfully reset prefix cache
(RayWorkerWrapper pid=3037, ip=100.64.0.9) INFO 09-16 09:55:09 [gpu_worker.py:104] Sleep mode freed 125.00 GiB memory, 7.36 GiB memory is still in use.
(VllmGenerationWorker pid=2243, ip=100.64.0.9) INFO 09-16 09:55:09 [executor_base.py:187] It took 1.715735 seconds to fall asleep.
▶ Processing rewards...,
▶ Computing advantages...
▶ Preparing for logprob inference...
▶ Computing logprobs...
▶ Preparing for training...
▶ Training policy...
/opt/nemo-rl/nemo_rl/algorithms/grpo.py:823: UserWarning: You asked to save checkpoints based on val_reward but the metric is not found in the save state. Saving most recent k checkpoints instead.
  warnings.warn(
Saving checkpoint for step 10...
/opt/nemo-rl/nemo_rl/utils/checkpoint.py:198: UserWarning: Metric val_reward not found in checkpoint history. Keeping most recent k checkpoints.
  warnings.warn(
(MegatronPolicyWorker[rank=0] pid=17933) saving checkpoint at iteration       0 to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_10/policy/weights in torch_dist format
(MegatronPolicyWorker[rank=0] pid=17933) Storing distributed optimizer sharded state of type fully_sharded_model_space
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 09:55:08 [block_pool.py:321] Successfully reset prefix cache [repeated 6x across cluster]
(MegatronPolicyWorker[rank=21] pid=5539, ip=100.64.0.9) GPU Memory before optimizer offload: 4.14GB allocated, 4.25GB reserved [repeated 32x across cluster]
(MegatronPolicyWorker[rank=20] pid=5538, ip=100.64.0.9) GPU Memory after optimizer offload: 4.14GB allocated, 4.30GB reserved [repeated 32x across cluster]
(RayWorkerWrapper pid=15030) INFO 09-16 09:55:10 [gpu_worker.py:104] Sleep mode freed 124.54 GiB memory, 7.78 GiB memory is still in use. [repeated 31x across cluster]
(VllmGenerationWorker pid=14172) INFO 09-16 09:55:10 [executor_base.py:187] It took 1.914869 seconds to fall asleep. [repeated 3x across cluster]
(MegatronPolicyWorker[rank=24] pid=6280, ip=100.64.0.8) Saved checkpoint to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_10/policy/weights
(MegatronPolicyWorker[rank=0] pid=17933)   successfully saved checkpoint from iteration       0 to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_10/policy/weights [ t 1/8, p 1/4 ]
Logged data to logs/exp_001/train_data_step9.jsonl

...
========================= Step 20/10078 =========================
📊 Validation Results:
    • Accuracy: 0.0339
    • Average response length: 886.5 tokens
    • Samples processed: 384

  ⏱️  Validation Timing:
    • Total validation time: 183.40s
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 10:18:30 [block_pool.py:321] Successfully reset prefix cache
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 10:15:23 [executor_base.py:203] It took 1.211350 seconds to wake up tags ['weights']. [repeated 3x across cluster]
(MegatronPolicyWorker[rank=21] pid=5539, ip=100.64.0.9) GPU Memory before optimizer offload: 6.60GB allocated, 72.21GB reserved [repeated 31x across cluster]
(MegatronPolicyWorker[rank=3] pid=17786) GPU Memory after refit complete: 0.05GB allocated, 0.15GB reserved [repeated 31x across cluster]
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 10:15:27 [executor_base.py:203] It took 0.185528 seconds to wake up tags ['kv_cache']. [repeated 3x across cluster]
(RayWorkerWrapper pid=3354, ip=100.64.0.8) INFO 09-16 10:18:32 [gpu_worker.py:104] Sleep mode freed 124.92 GiB memory, 8.16 GiB memory is still in use.
(VllmGenerationWorker pid=2566, ip=100.64.0.8) INFO 09-16 10:18:32 [executor_base.py:187] It took 1.493706 seconds to fall asleep.
Saving checkpoint for step 20...
(MegatronPolicyWorker[rank=0] pid=17933) saving checkpoint at iteration       0 to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_20/policy/weights in torch_dist format
(MegatronPolicyWorker[rank=0] pid=17933) Storing distributed optimizer sharded state of type fully_sharded_model_space
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 10:18:30 [block_pool.py:321] Successfully reset prefix cache [repeated 7x across cluster]
(RayWorkerWrapper pid=15033) INFO 09-16 10:18:32 [gpu_worker.py:104] Sleep mode freed 125.39 GiB memory, 7.38 GiB memory is still in use. [repeated 31x across cluster]
(VllmGenerationWorker pid=14172) INFO 09-16 10:18:32 [executor_base.py:187] It took 1.726623 seconds to fall asleep. [repeated 3x across cluster]
(MegatronPolicyWorker[rank=8] pid=4950, ip=100.64.0.14) Saved checkpoint to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_20/policy/weights

...

========================= Step 21/10078 =========================
▶ Preparing batch...
▶ Generating responses for batch of size 32...
2025-09-16 10:20:03,241	ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::VllmGenerationWorker.wake_up() (pid=1101, ip=100.64.0.14, actor_id=bef3b99dbaaf21eb44917ae501000000, repr=VllmGenerationWorker)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/nemo_rl/models/generation/vllm/vllm_worker.py", line 832, in wake_up
    self.llm.wake_up(**wake_up_args)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1524, in wake_up
    self.llm_engine.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 281, in wake_up
    self.engine_core.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 264, in wake_up
    self.engine_core.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 353, in wake_up
    self.model_executor.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 201, in wake_up
    self.collective_rpc("wake_up", kwargs=dict(tags=tags))
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 308, in collective_rpc
    return self._run_workers(method, *args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 503, in _run_workers
    ray_worker_outputs = ray.get(ray_worker_outputs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=1964, ip=100.64.0.14, actor_id=1a6384abfebbb0059c7e39ad01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f80905f01a0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 620, in execute_method
    raise e
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 611, in execute_method
    return run_method(self, method, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2985, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in wake_up
    allocator.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 225, in wake_up
    create_and_map(handle)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 78, in create_and_map
    python_create_and_map(*allocation_handle)
RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
(RayWorkerWrapper pid=1965, ip=100.64.0.14) CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
```

**Environment overview (please complete the following information)**

Environment location: Lepton running a NemoRL image built as [stated here](https://docs.nvidia.com/nemo/rl/latest/docker.html#build-docker-images).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Out of Memory Error When Validation and Checkpointing Periods Are Misaligned #1137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Out of Memory Error When Validation and Checkpointing Periods Are Misaligned #1137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions