Skip to content

Cannot train the tau-bench datasets on qwen3 model #270

@aongwachi1

Description

@aongwachi1

I'm training the qwen3 model using the tau-bench datasets follow to the example source code in dev folder. Unfortunatedly, i've got the error as belowed:

import os
from dotenv import load_dotenv

load_dotenv()

# Required
OPENAI_API_KEY = "XXXX"
if OPENAI_API_KEY:
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

OPENPIPE_API_KEY = "XXX"
if OPENPIPE_API_KEY:
    os.environ["OPENPIPE_API_KEY"] = OPENPIPE_API_KEY

# Optional
WANDB_API_KEY = "XXX"
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY

if not os.environ.get("WANDB_API_KEY"):
    print("WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.")
import art
from dotenv import load_dotenv
from tau_bench.types import TauBenchPolicyConfig, TauBenchTrainingConfig
from run_rl import train
from run import RunConfig
import torch

load_dotenv()

MODEL_NAME = "tau-bench-retail-agent-005"
model = art.TrainableModel(
    name=MODEL_NAME,
    project="amity-agentic",
    base_model="Qwen/Qwen3-8B",
    config=TauBenchPolicyConfig(
        training_config=TauBenchTrainingConfig(
            trajectories_per_group=4,
            groups_per_step=4,
            learning_rate=1.2e-5,
            eval_steps=10,
            val_set_size=10,
            training_dataset_size=30,
            num_epochs=50,
            train_mode="sync_rl",
        ),
        run_config=RunConfig(
            env="airline",
            model_provider="hosted_vllm",
            user_model_provider="openai",
            model=MODEL_NAME,
            user_model="gpt-4o",
            user_strategy="llm",
            agent_strategy="tool-calling-rl",
            temperature=1.0,
            task_split="test",
            log_dir="rl_results",
            skip_eval=False,
        ),
    ),
    # torch.cuda.device_count()
    _internal_config=art.dev.InternalModelConfig(
        engine_args=art.dev.EngineArgs(
            tensor_parallel_size=1, gpu_memory_utilization=0.85
        ),
        torchtune_args=art.dev.TorchtuneArgs(
            model="qwen3_8b_instruct", model_type="QWEN3", async_weight_syncing=True
        ),
    ),
)
await train(model)

RUNTIME ERROR

wandb: ERROR Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Currently logged in as: wachiravit (amity-lab) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
Tracking run with wandb version 0.20.1
Run data is saved locally in /workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/wandb/run-20250724_110455-tau-bench-retail-agent-005
Resuming run [tau-bench-retail-agent-005](https://wandb.ai/amity-lab/amity-agentic/runs/tau-bench-retail-agent-005) to [Weights & Biases](https://wandb.ai/amity-lab/amity-agentic) ([docs](https://wandb.me/developer-guide))
View project at https://wandb.ai/amity-lab/amity-agentic
View run at https://wandb.ai/amity-lab/amity-agentic/runs/tau-bench-retail-agent-005
INFO 07-24 11:05:01 [__init__.py:244] Automatically detected platform cuda.
WARNING 07-24 11:05:07 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 07-24 11:05:08 [__init__.py:244] Automatically detected platform cuda.
/root/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/9c925d64d72725edaf899c6cb9c377fd0709d9c5
INFO 07-24 11:05:18 [config.py:823] This model supports multiple tasks: {'generate', 'reward', 'classify', 'score', 'embed'}. Defaulting to 'generate'.
INFO 07-24 11:05:18 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 07-24 11:05:19 [utils.py:2597] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
WARNING 07-24 11:05:20 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 07-24 11:05:22 [__init__.py:244] Automatically detected platform cuda.
INFO 07-24 11:05:24 [core.py:455] Waiting for init message from front-end.
INFO 07-24 11:05:24 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Qwen/Qwen3-8B', speculative_config=None, tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
INFO 07-24 11:05:24 [worker_base.py:590] Injected <class 'art.vllm.engine.WorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['run', 'time']
WARNING 07-24 11:05:24 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f4d72d1a710>
INFO 07-24 11:05:25 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 07-24 11:05:25 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 07-24 11:05:25 [gpu_model_runner.py:1595] Starting to load model Qwen/Qwen3-8B...
INFO 07-24 11:05:25 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 07-24 11:05:25 [cuda.py:252] Using Flash Attention backend on V1 engine.
INFO 07-24 11:05:26 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.18it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.18it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.20it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.49it/s]

INFO 07-24 11:05:30 [default_loader.py:272] Loading weights took 3.38 seconds
INFO 07-24 11:05:30 [gpu_model_runner.py:1624] Model loading took 15.2683 GiB and 4.408849 seconds
INFO 07-24 11:05:36 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/7a29998acf/rank_0_0 for vLLM's torch.compile
INFO 07-24 11:05:36 [backends.py:472] Dynamo bytecode transform time: 6.37 s
INFO 07-24 11:05:42 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 4.985 s
INFO 07-24 11:05:42 [monitor.py:34] torch.compile takes 6.37 s in total
INFO 07-24 11:05:43 [gpu_worker.py:227] Available KV cache memory: 97.78 GiB
INFO 07-24 11:05:43 [kv_cache_utils.py:715] GPU KV cache size: 712,032 tokens
INFO 07-24 11:05:43 [kv_cache_utils.py:719] Maximum concurrency for 40,960 tokens per request: 17.38x
INFO 07-24 11:05:58 [gpu_model_runner.py:2048] Graph capturing finished in 15 secs, took 0.75 GiB
INFO 07-24 11:05:58 [core.py:171] init engine (profile, create kv cache, warmup model) took 28.58 seconds
INFO 07-24 11:05:59 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 44502
Loading training tasks...
Training on 30 tasks
Validation on 10 tasks
Iterating dataset:   0%
 0/400 [00:00<?, ?batch/s]

--- Training Step 0 (Epoch 0, Step 0) ---

--- Evaluating at Step 0 ---
Evaluating model on 10 tasks...
gather: 100%
 10/10 [00:44<00:00,  6.11s/it, reward=0.1, total_steps=13.8, final_prompt_tokens=7034.0, avg_completion_tokens=74.9, max_completion_tokens=314, outcome_correct=0.2, forced_stop=0.1, duration=21.3, completion_tokens=74.9]
Eval task 30: reward=0.0
Eval task 31: reward=0.0
Eval task 32: reward=0.0
Eval task 33: reward=-1
Eval task 34: reward=0.0
Eval task 35: reward=0.0
Eval task 36: reward=1.0
Eval task 37: reward=1.0
Eval task 38: reward=0.0
Eval task 39: reward=0.0
Average evaluation reward: 0.1
Generating trajectories for 4 tasks...
gather: 100%
 16/16 [01:09<00:00,  6.31s/it, reward=-0.125, total_steps=17.8, final_prompt_tokens=7.89e+3, avg_completion_tokens=92.1, max_completion_tokens=390, outcome_correct=0.0625, forced_stop=0.188, duration=33.1, completion_tokens=92.1]
wandb: WARNING Tried to log to step 0 that is less than the current step 1. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.
wandb: WARNING Tried to log to step 0 that is less than the current step 1. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.
Training on 4 trajectory groups...
Packed 12 trajectories into 8 sequences of length 16384
train:   0%
 0/8 [00:00<?, ?it/s]
ERROR 07-24 11:08:21 [core.py:583] Invocation of collective_rpc method failed
ERROR 07-24 11:08:21 [core.py:583] Traceback (most recent call last):
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 580, in _handle_client_request
ERROR 07-24 11:08:21 [core.py:583]     output.result = method(
ERROR 07-24 11:08:21 [core.py:583]                     ^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 347, in collective_rpc
ERROR 07-24 11:08:21 [core.py:583]     return self.model_executor.collective_rpc(method, timeout, args,
ERROR 07-24 11:08:21 [core.py:583]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-24 11:08:21 [core.py:583]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-24 11:08:21 [core.py:583]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 07-24 11:08:21 [core.py:583]     return func(*args, **kwargs)
ERROR 07-24 11:08:21 [core.py:583]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/src/art/vllm/engine.py", line 131, in run
ERROR 07-24 11:08:21 [core.py:583]     return func(*args, **kwargs)
ERROR 07-24 11:08:21 [core.py:583]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/src/art/torchtune/service.py", line 291, in sleep
ERROR 07-24 11:08:21 [core.py:583]     worker.sleep(level)
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 93, in sleep
ERROR 07-24 11:08:21 [core.py:583]     allocator.sleep(offload_tags=("weights", ) if level == 1 else tuple())
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/src/art/vllm/patches.py", line 54, in sleep
ERROR 07-24 11:08:21 [core.py:583]     cpu_backup_tensor = torch.empty(
ERROR 07-24 11:08:21 [core.py:583]                         ^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583] RuntimeError: CUDA error: invalid argument
ERROR 07-24 11:08:21 [core.py:583] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 07-24 11:08:21 [core.py:583] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 07-24 11:08:21 [core.py:583] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 07-24 11:08:21 [core.py:583]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions