Skip to content

Out of Memory Error When Validation and Checkpointing Periods Are Misaligned #1137

@DWarez

Description

@DWarez

Describe the bug

NeMo RL encounters a CUDA out of memory error when val_period and checkpointing.save_period are configured with different values, specifically when checkpointing occurs more frequently than validation (I didn't try different configurations)

Steps/Code to reproduce bug

Working Configuration

grpo:
  val_period: 20
...
checkpointing:
  save_period: 20

This configuration runs without memory issues.

grpo:
  val_period: 20
...
checkpointing:
  save_period: 10

This configuration causes CUDA out of memory errors at step 21 (immediately after the second validation at step 20).

Timeline Analysis
From the logs, the sequence of events is:

Step 10: Checkpointing occurs, memory is managed properly
Steps 11-19: Training continues normally
Step 20: Validation runs
Step 20: Another checkpoint is saved after validation
Step 21: Attempt to start next batch generation fails with OOM

Here are highlights of the full log:

...

========================= Step 10/10078 =========================
▶ Preparing batch...
▶ Generating responses for batch of size 32...
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 09:53:11 [block_pool.py:321] Successfully reset prefix cache [repeated 7x across cluster]
(MegatronPolicyWorker[rank=8] pid=4950, ip=100.64.0.14) GPU Memory before optimizer offload: 28.74GB allocated, 28.76GB reserved [repeated 33x across cluster]
(RayWorkerWrapper pid=3032, ip=100.64.0.9) INFO 09-16 09:53:13 [gpu_worker.py:104] Sleep mode freed 124.54 GiB memory, 7.95 GiB memory is still in use. [repeated 31x across cluster]
(VllmGenerationWorker pid=2243, ip=100.64.0.9) INFO 09-16 09:53:13 [executor_base.py:187] It took 1.904843 seconds to fall asleep. [repeated 3x across cluster]
(MegatronPolicyWorker[rank=4] pid=18088) GPU Memory after optimizer offload: 4.43GB allocated, 4.53GB reserved [repeated 31x across cluster]
(MegatronPolicyWorker[rank=4] pid=18088) GPU Memory before optimizer offload: 30.77GB allocated, 30.79GB reserved [repeated 31x across cluster]
(MegatronPolicyWorker[rank=16] pid=5243, ip=100.64.0.9) GPU Memory after optimizer offload: 4.14GB allocated, 4.30GB reserved
(MegatronPolicyWorker[rank=22] pid=5536, ip=100.64.0.9) GPU Memory after optimizer offload: 4.14GB allocated, 4.21GB reserved
(MegatronPolicyWorker[rank=20] pid=5538, ip=100.64.0.9) GPU Memory after optimizer offload: 4.14GB allocated, 4.30GB reserved [repeated 30x across cluster]
(VllmGenerationWorker pid=14172) INFO 09-16 09:54:05 [executor_base.py:203] It took 0.724005 seconds to wake up tags ['weights'].
[Refit] Split 563 keys into 7 groups
Adding requests: 100%|██████████| 8/8 [00:00<00:00, 10436.84it/s]
Processed prompts:   0%|          | 0/8 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts:  12%|█▎        | 1/8 [00:07<00:53,  7.65s/it, est. speed input: 10.20 toks/s, output: 29.16 toks/s]
Adding requests: 100%|██████████| 8/8 [00:00<00:00, 10286.46it/s] [repeated 3x across cluster]
Processed prompts:   0%|          | 0/8 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s] [repeated 3x across cluster]
Processed prompts:  25%|██▌       | 2/8 [00:08<00:21,  3.64s/it, est. speed input: 18.39 toks/s, output: 55.42 toks/s]
Processed prompts:  38%|███▊      | 3/8 [00:10<00:14,  2.85s/it, est. speed input: 22.52 toks/s, output: 74.96 toks/s]
Processed prompts:  50%|█████     | 4/8 [00:10<00:07,  1.78s/it, est. speed input: 29.60 toks/s, output: 103.70 toks/s]
Processed prompts:  62%|██████▎   | 5/8 [00:11<00:04,  1.62s/it, est. speed input: 32.88 toks/s, output: 122.34 toks/s]
Processed prompts:  75%|███████▌  | 6/8 [00:13<00:03,  1.74s/it, est. speed input: 33.82 toks/s, output: 135.37 toks/s]
Processed prompts:  88%|████████▊ | 7/8 [00:14<00:01,  1.28s/it, est. speed input: 38.53 toks/s, output: 162.72 toks/s]
Processed prompts: 100%|██████████| 8/8 [00:16<00:00,  2.03s/it, est. speed input: 38.41 toks/s, output: 172.86 toks/s]
Processed prompts:  12%|█▎        | 1/8 [00:19<02:16, 19.52s/it, est. speed input: 7.58 toks/s, output: 32.58 toks/s] [repeated 2x across cluster]
Processed prompts:  12%|█▎        | 1/8 [00:27<03:15, 27.87s/it, est. speed input: 5.92 toks/s, output: 31.50 toks/s] [repeated 7x across cluster]
Processed prompts:  50%|█████     | 4/8 [00:32<00:20,  5.12s/it, est. speed input: 20.03 toks/s, output: 117.20 toks/s] [repeated 8x across cluster]
Processed prompts: 100%|██████████| 8/8 [00:34<00:00,  4.37s/it, est. speed input: 33.85 toks/s, output: 209.39 toks/s]
Processed prompts:  88%|████████▊ | 7/8 [00:40<00:03,  3.21s/it, est. speed input: 28.69 toks/s, output: 185.36 toks/s] [repeated 3x across cluster]
Processed prompts: 100%|██████████| 8/8 [00:45<00:00,  5.70s/it, est. speed input: 28.93 toks/s, output: 195.37 toks/s]
Processed prompts: 100%|██████████| 8/8 [00:57<00:00,  7.18s/it, est. speed input: 21.31 toks/s, output: 125.62 toks/s]
(MegatronPolicyWorker[rank=8] pid=4950, ip=100.64.0.14) GPU Memory before optimizer offload: 6.59GB allocated, 72.05GB reserved
(MegatronPolicyWorker[rank=24] pid=6280, ip=100.64.0.8) GPU Memory after optimizer offload: 6.78GB allocated, 22.29GB reserved [repeated 5x across cluster]
(MegatronPolicyWorker[rank=8] pid=4950, ip=100.64.0.14) GPU Memory after refit complete: 0.05GB allocated, 0.07GB reserved
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 09:54:05 [executor_base.py:203] It took 1.009648 seconds to wake up tags ['weights']. [repeated 3x across cluster]
(VllmGenerationWorker pid=2566, ip=100.64.0.8) INFO 09-16 09:54:10 [executor_base.py:203] It took 0.180229 seconds to wake up tags ['kv_cache'].
(VllmGenerationWorker pid=2566, ip=100.64.0.8) INFO 09-16 09:55:08 [block_pool.py:321] Successfully reset prefix cache
(MegatronPolicyWorker[rank=4] pid=18088) GPU Memory before optimizer offload: 6.59GB allocated, 72.65GB reserved [repeated 31x across cluster]
(MegatronPolicyWorker[rank=20] pid=5538, ip=100.64.0.9) GPU Memory after optimizer offload: 6.59GB allocated, 22.10GB reserved [repeated 27x across cluster]
(MegatronPolicyWorker[rank=20] pid=5538, ip=100.64.0.9) GPU Memory after refit complete: 0.05GB allocated, 0.21GB reserved [repeated 31x across cluster]
(VllmGenerationWorker pid=2243, ip=100.64.0.9) INFO 09-16 09:54:10 [executor_base.py:203] It took 0.280656 seconds to wake up tags ['kv_cache']. [repeated 3x across cluster]
(VllmGenerationWorker pid=2566, ip=100.64.0.8) INFO 09-16 09:55:08 [block_pool.py:321] Successfully reset prefix cache
(RayWorkerWrapper pid=3037, ip=100.64.0.9) INFO 09-16 09:55:09 [gpu_worker.py:104] Sleep mode freed 125.00 GiB memory, 7.36 GiB memory is still in use.
(VllmGenerationWorker pid=2243, ip=100.64.0.9) INFO 09-16 09:55:09 [executor_base.py:187] It took 1.715735 seconds to fall asleep.
▶ Processing rewards...,
▶ Computing advantages...
▶ Preparing for logprob inference...
▶ Computing logprobs...
▶ Preparing for training...
▶ Training policy...
/opt/nemo-rl/nemo_rl/algorithms/grpo.py:823: UserWarning: You asked to save checkpoints based on val_reward but the metric is not found in the save state. Saving most recent k checkpoints instead.
  warnings.warn(
Saving checkpoint for step 10...
/opt/nemo-rl/nemo_rl/utils/checkpoint.py:198: UserWarning: Metric val_reward not found in checkpoint history. Keeping most recent k checkpoints.
  warnings.warn(
(MegatronPolicyWorker[rank=0] pid=17933) saving checkpoint at iteration       0 to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_10/policy/weights in torch_dist format
(MegatronPolicyWorker[rank=0] pid=17933) Storing distributed optimizer sharded state of type fully_sharded_model_space
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 09:55:08 [block_pool.py:321] Successfully reset prefix cache [repeated 6x across cluster]
(MegatronPolicyWorker[rank=21] pid=5539, ip=100.64.0.9) GPU Memory before optimizer offload: 4.14GB allocated, 4.25GB reserved [repeated 32x across cluster]
(MegatronPolicyWorker[rank=20] pid=5538, ip=100.64.0.9) GPU Memory after optimizer offload: 4.14GB allocated, 4.30GB reserved [repeated 32x across cluster]
(RayWorkerWrapper pid=15030) INFO 09-16 09:55:10 [gpu_worker.py:104] Sleep mode freed 124.54 GiB memory, 7.78 GiB memory is still in use. [repeated 31x across cluster]
(VllmGenerationWorker pid=14172) INFO 09-16 09:55:10 [executor_base.py:187] It took 1.914869 seconds to fall asleep. [repeated 3x across cluster]
(MegatronPolicyWorker[rank=24] pid=6280, ip=100.64.0.8) Saved checkpoint to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_10/policy/weights
(MegatronPolicyWorker[rank=0] pid=17933)   successfully saved checkpoint from iteration       0 to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_10/policy/weights [ t 1/8, p 1/4 ]
Logged data to logs/exp_001/train_data_step9.jsonl

...
========================= Step 20/10078 =========================
📊 Validation Results:
    • Accuracy: 0.0339
    • Average response length: 886.5 tokens
    • Samples processed: 384

  ⏱️  Validation Timing:
    • Total validation time: 183.40s
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 10:18:30 [block_pool.py:321] Successfully reset prefix cache
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 10:15:23 [executor_base.py:203] It took 1.211350 seconds to wake up tags ['weights']. [repeated 3x across cluster]
(MegatronPolicyWorker[rank=21] pid=5539, ip=100.64.0.9) GPU Memory before optimizer offload: 6.60GB allocated, 72.21GB reserved [repeated 31x across cluster]
(MegatronPolicyWorker[rank=3] pid=17786) GPU Memory after refit complete: 0.05GB allocated, 0.15GB reserved [repeated 31x across cluster]
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 10:15:27 [executor_base.py:203] It took 0.185528 seconds to wake up tags ['kv_cache']. [repeated 3x across cluster]
(RayWorkerWrapper pid=3354, ip=100.64.0.8) INFO 09-16 10:18:32 [gpu_worker.py:104] Sleep mode freed 124.92 GiB memory, 8.16 GiB memory is still in use.
(VllmGenerationWorker pid=2566, ip=100.64.0.8) INFO 09-16 10:18:32 [executor_base.py:187] It took 1.493706 seconds to fall asleep.
Saving checkpoint for step 20...
(MegatronPolicyWorker[rank=0] pid=17933) saving checkpoint at iteration       0 to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_20/policy/weights in torch_dist format
(MegatronPolicyWorker[rank=0] pid=17933) Storing distributed optimizer sharded state of type fully_sharded_model_space
(VllmGenerationWorker pid=1101, ip=100.64.0.14) INFO 09-16 10:18:30 [block_pool.py:321] Successfully reset prefix cache [repeated 7x across cluster]
(RayWorkerWrapper pid=15033) INFO 09-16 10:18:32 [gpu_worker.py:104] Sleep mode freed 125.39 GiB memory, 7.38 GiB memory is still in use. [repeated 31x across cluster]
(VllmGenerationWorker pid=14172) INFO 09-16 10:18:32 [executor_base.py:187] It took 1.726623 seconds to fall asleep. [repeated 3x across cluster]
(MegatronPolicyWorker[rank=8] pid=4950, ip=100.64.0.14) Saved checkpoint to /mnt/dsalvati/checkpoints/qwen2_72b-grpo-math/tmp_step_20/policy/weights

...

========================= Step 21/10078 =========================
▶ Preparing batch...
▶ Generating responses for batch of size 32...
2025-09-16 10:20:03,241	ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::VllmGenerationWorker.wake_up() (pid=1101, ip=100.64.0.14, actor_id=bef3b99dbaaf21eb44917ae501000000, repr=VllmGenerationWorker)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/nemo_rl/models/generation/vllm/vllm_worker.py", line 832, in wake_up
    self.llm.wake_up(**wake_up_args)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1524, in wake_up
    self.llm_engine.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 281, in wake_up
    self.engine_core.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 264, in wake_up
    self.engine_core.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 353, in wake_up
    self.model_executor.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 201, in wake_up
    self.collective_rpc("wake_up", kwargs=dict(tags=tags))
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 308, in collective_rpc
    return self._run_workers(method, *args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 503, in _run_workers
    ray_worker_outputs = ray.get(ray_worker_outputs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=1964, ip=100.64.0.14, actor_id=1a6384abfebbb0059c7e39ad01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f80905f01a0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 620, in execute_method
    raise e
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 611, in execute_method
    return run_method(self, method, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2985, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in wake_up
    allocator.wake_up(tags)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 225, in wake_up
    create_and_map(handle)
  File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 78, in create_and_map
    python_create_and_map(*allocation_handle)
RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
(RayWorkerWrapper pid=1965, ip=100.64.0.14) CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62

Environment overview (please complete the following information)

Environment location: Lepton running a NemoRL image built as stated here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions