RTX A6000 x2 is not enough to RL Qwen-2.5-1.5b with the provided instructions #75

oxysoft · 2025-06-06T17:11:50Z

oxysoft
Jun 6, 2025

On Twitter it was said that x2 A6000 is the minimal configuration that should work to fine-tune Qwen-2.5-1.5b in about an hour for most experiment. I run vf-vllm like this:

(main) root@C.20810974:/workspace$ OPENAI_API_KEY="..."  MPLBACKEND="agg"  CUDA_VISIBLE_DEVICES=0,1  NCCL_DEBUG=WARN  vf-vllm --model 'Qwen/Qwen2.5-1.5B-Instruct' --tensor-parallel-size 2
INFO 06-06 17:06:49 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-06 17:06:49 [__init__.py:239] Automatically detected platform cuda.
2025-06-06 17:06:53 - verifiers.inference.vllm_server - INFO - Starting Uvicorn with 1 worker(s). Data parallel size: 1
INFO:     Started server process [8183]
INFO:     Waiting for application startup.
2025-06-06 17:06:53 - verifiers.inference.vllm_server - INFO - Lifespan: Waiting for 1 LLM worker(s) to be ready...
INFO 06-06 17:07:01 [config.py:717] This model supports multiple tasks: {'score', 'reward', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 06-06 17:07:01 [config.py:1770] Defaulting to use mp for distributed inference
INFO 06-06 17:07:01 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-06 17:07:08 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-06 17:07:09 [__init__.py:239] Automatically detected platform cuda.
INFO 06-06 17:07:12 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 06-06 17:07:12 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 36 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 06-06 17:07:12 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_d34fab2e'), local_subscribe_addr='ipc:///tmp/0f336182-91eb-444d-bf76-6a9f0e22b5d3', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 06-06 17:07:17 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-06 17:07:17 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-06 17:07:17 [__init__.py:239] Automatically detected platform cuda.
INFO 06-06 17:07:17 [__init__.py:239] Automatically detected platform cuda.
INFO 06-06 17:07:21 [worker_base.py:589] Injected <class 'verifiers.inference.vllm_server.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
WARNING 06-06 17:07:21 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f0ea2b2ade0>
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:21 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_ce5db365'), local_subscribe_addr='ipc:///tmp/5060de36-4129-4e41-a868-be5f5710405a', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 06-06 17:07:21 [worker_base.py:589] Injected <class 'verifiers.inference.vllm_server.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
WARNING 06-06 17:07:21 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f318f0e0590>
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:21 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2d9bc26f'), local_subscribe_addr='ipc:///tmp/4abac348-fb47-4fce-939c-9a019321a566', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=8401) (VllmWorker rank=1 pid=8402) INFO 06-06 17:07:22 [utils.py:1055] Found nccl from library libnccl.so.2
INFO 06-06 17:07:22 [utils.py:1055] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=8402) (VllmWorker rank=0 pid=8401) INFO 06-06 17:07:22 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 06-06 17:07:22 [pynccl.py:69] vLLM is using nccl==2.21.5
NCCL version 2.21.5+cuda12.4
(VllmWorker rank=0 pid=8401) (VllmWorker rank=1 pid=8402) INFO 06-06 17:07:22 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-06 17:07:22 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=8401) (VllmWorker rank=1 pid=8402) WARNING 06-06 17:07:22 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 06-06 17:07:22 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:22 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_18621915'), local_subscribe_addr='ipc:///tmp/5ac684b5-5a50-4a81-a973-7d458d1413d8', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:22 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:22 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:22 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:22 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=8402) WARNING 06-06 17:07:22 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=8401) WARNING 06-06 17:07:22 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:22 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:22 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:22 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:23 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:23 [weight_utils.py:315] No model.safetensors.index.json found in remote.
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:23 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:24 [loader.py:458] Loading weights took 0.58 seconds
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.89it/s]
(VllmWorker rank=0 pid=8401) 
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:24 [gpu_model_runner.py:1347] Model loading took 1.4479 GiB and 1.600259 seconds
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:24 [loader.py:458] Loading weights took 0.57 seconds
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:24 [gpu_model_runner.py:1347] Model loading took 1.4479 GiB and 1.913942 seconds
(VllmWorker rank=1 pid=8402) (VllmWorker rank=0 pid=8401) INFO 06-06 17:07:33 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/facdf8f503/rank_1_0 for vLLM's torch.compile
INFO 06-06 17:07:33 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/facdf8f503/rank_0_0 for vLLM's torch.compile
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:33 [backends.py:430] Dynamo bytecode transform time: 8.51 s
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:33 [backends.py:430] Dynamo bytecode transform time: 8.51 s
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:39 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 5.065 s
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:39 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 5.085 s
(VllmWorker rank=0 pid=8401) INFO 06-06 17:07:40 [monitor.py:33] torch.compile takes 8.51 s in total
(VllmWorker rank=1 pid=8402) INFO 06-06 17:07:40 [monitor.py:33] torch.compile takes 8.51 s in total
INFO 06-06 17:07:41 [kv_cache_utils.py:634] GPU KV cache size: 3,119,136 tokens
INFO 06-06 17:07:41 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 380.75x
INFO 06-06 17:07:41 [kv_cache_utils.py:634] GPU KV cache size: 3,119,136 tokens
INFO 06-06 17:07:41 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 380.75x
(VllmWorker rank=1 pid=8402) INFO 06-06 17:08:08 [gpu_model_runner.py:1686] Graph capturing finished in 27 secs, took 0.96 GiB
(VllmWorker rank=0 pid=8401) INFO 06-06 17:08:08 [gpu_model_runner.py:1686] Graph capturing finished in 27 secs, took 0.96 GiB
INFO 06-06 17:08:08 [core.py:159] init engine (profile, create kv cache, warmup model) took 43.76 seconds
INFO 06-06 17:08:08 [core_client.py:439] Core engine process 0 ready.
2025-06-06 17:08:09 - verifiers.inference.vllm_server - INFO - Lifespan: Received message from worker 0: {'status': 'ready'}
2025-06-06 17:08:09 - verifiers.inference.vllm_server - INFO - Lifespan: LLM worker 0 reported ready.
2025-06-06 17:08:09 - verifiers.inference.vllm_server - INFO - Lifespan: All 1 LLM worker(s) are ready. Proceeding to yield.
2025-06-06 17:08:09 - verifiers.inference.vllm_server - INFO - Lifespan: Initialized request queue for batched chat completions.
2025-06-06 17:08:09 - verifiers.inference.vllm_server - INFO - Lifespan: Started batch processing task for chat completions.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then I run nvidia-smi and see that nearly all the memory is taken

Fri Jun  6 17:08:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.144                Driver Version: 570.144        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               On  |   00000000:03:00.0 Off |                  Off |
| 30%   36C    P0             67W /  300W |   47258MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               On  |   00000000:04:00.0 Off |                  Off |
| 30%   36C    P0             68W /  300W |   47258MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Finally I run my application which starts the GRPOTrainer and see this

[17:10:14] ERROR    Training failed: CUDA out of memory. Tried to allocate 20.00
                    MiB. GPU 0 has a total capacity of 47.41 GiB of which 6.00  
                    MiB is free. Process 243193 has 46.14 GiB memory in use.    
                    Process 244621 has 1.25 GiB memory in use. Of the allocated 
                    memory 981.56 MiB is allocated by PyTorch, and 40.44 MiB is 
                    reserved by PyTorch but unallocated. If reserved but        
                    unallocated memory is large try setting                     
                    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid   
                    fragmentation.  See documentation for Memory Management     
                    (https://pytorch.org/docs/stable/notes/cuda.html#environment
                    -variables)                                                 
Traceback (most recent call last):
  File "/thauten/train_compressor.py", line 846, in <module>
    main()
  File "/thauten/train_compressor.py", line 820, in main
    current_model_path = train_stage(stage, compressor_env, current_model_path, base_model_name)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/thauten/train_compressor.py", line 711, in train_stage
    trainer = vf.GRPOTrainer(
              ^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/verifiers/trainers/grpo_trainer.py", line 229, in __init__
    super().__init__(
  File "/venv/main/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/transformers/trainer.py", line 622, in __init__
    self._move_model_to_device(model, args.device)
  File "/venv/main/lib/python3.12/site-packages/transformers/trainer.py", line 905, in _move_model_to_device
    model = model.to(device)
            ^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 3851, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1343, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 903, in _apply
    module._apply(fn)
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 903, in _apply
    module._apply(fn)
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 903, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 930, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1329, in convert
    return t.to(
           ^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 47.41 GiB of which 6.00 MiB is free. Process 243193 has 46.14 GiB memory in use. Process 244621 has 1.25 GiB memory in use. Of the allocated memory 981.56 MiB is allocated by PyTorch, and 40.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

which is expected given that there is no memory left for backpropagation. What am I doing wrong?

Thanks

Answered by willccbb

Jun 10, 2025

if you have 2 GPUs, make sure only 1 is used for training and 1 is used for inference

so CUDA_VISIBLE_DEVICES=0 vf-vllm .... and CUDA_VISIBLE_DEVICES=1 accelerate launch ...

vLLM will take ~all (up to 90%) of the memory you allow it to see, which leaves no space for the trainer on the same GPUs

this should work with a 1.5B model if you tune your batch size / context length

View full answer

oxysoft · 2025-06-06T17:14:11Z

oxysoft
Jun 6, 2025
Author

It seems that vllm defaults to using 90% of VRAM for kv cache and should use --gpu-memory-utilization to lower this. However this was not mentioned in the repository so I'm not sure if this is intended.

0 replies

willccbb · 2025-06-10T21:29:31Z

willccbb
Jun 10, 2025
Maintainer

if you have 2 GPUs, make sure only 1 is used for training and 1 is used for inference

so CUDA_VISIBLE_DEVICES=0 vf-vllm .... and CUDA_VISIBLE_DEVICES=1 accelerate launch ...

vLLM will take ~all (up to 90%) of the memory you allow it to see, which leaves no space for the trainer on the same GPUs

this should work with a 1.5B model if you tune your batch size / context length

1 reply

oxysoft Jun 12, 2025
Author

Cheers! I suggest making this less ambiguous on the front page for people who are new to this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RTX A6000 x2 is not enough to RL Qwen-2.5-1.5b with the provided instructions #75

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RTX A6000 x2 is not enough to RL Qwen-2.5-1.5b with the provided instructions #75

Uh oh!

Uh oh!

oxysoft Jun 6, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

oxysoft Jun 6, 2025 Author

Uh oh!

willccbb Jun 10, 2025 Maintainer

Uh oh!

oxysoft Jun 12, 2025 Author

oxysoft
Jun 6, 2025

Replies: 2 comments 1 reply

oxysoft
Jun 6, 2025
Author

willccbb
Jun 10, 2025
Maintainer

oxysoft Jun 12, 2025
Author