Cannot train the tau-bench datasets on qwen3 model

I'm training the qwen3 model using the tau-bench datasets follow to the example source code in dev folder. Unfortunatedly, i've got the error as belowed:

```
import os
from dotenv import load_dotenv

load_dotenv()

# Required
OPENAI_API_KEY = "XXXX"
if OPENAI_API_KEY:
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

OPENPIPE_API_KEY = "XXX"
if OPENPIPE_API_KEY:
    os.environ["OPENPIPE_API_KEY"] = OPENPIPE_API_KEY

# Optional
WANDB_API_KEY = "XXX"
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY

if not os.environ.get("WANDB_API_KEY"):
    print("WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.")
```

```
import art
from dotenv import load_dotenv
from tau_bench.types import TauBenchPolicyConfig, TauBenchTrainingConfig
from run_rl import train
from run import RunConfig
import torch

load_dotenv()

MODEL_NAME = "tau-bench-retail-agent-005"
model = art.TrainableModel(
    name=MODEL_NAME,
    project="amity-agentic",
    base_model="Qwen/Qwen3-8B",
    config=TauBenchPolicyConfig(
        training_config=TauBenchTrainingConfig(
            trajectories_per_group=4,
            groups_per_step=4,
            learning_rate=1.2e-5,
            eval_steps=10,
            val_set_size=10,
            training_dataset_size=30,
            num_epochs=50,
            train_mode="sync_rl",
        ),
        run_config=RunConfig(
            env="airline",
            model_provider="hosted_vllm",
            user_model_provider="openai",
            model=MODEL_NAME,
            user_model="gpt-4o",
            user_strategy="llm",
            agent_strategy="tool-calling-rl",
            temperature=1.0,
            task_split="test",
            log_dir="rl_results",
            skip_eval=False,
        ),
    ),
    # torch.cuda.device_count()
    _internal_config=art.dev.InternalModelConfig(
        engine_args=art.dev.EngineArgs(
            tensor_parallel_size=1, gpu_memory_utilization=0.85
        ),
        torchtune_args=art.dev.TorchtuneArgs(
            model="qwen3_8b_instruct", model_type="QWEN3", async_weight_syncing=True
        ),
    ),
)
await train(model)
```

RUNTIME ERROR
```
wandb: ERROR Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Currently logged in as: wachiravit (amity-lab) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
Tracking run with wandb version 0.20.1
Run data is saved locally in /workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/wandb/run-20250724_110455-tau-bench-retail-agent-005
Resuming run [tau-bench-retail-agent-005](https://wandb.ai/amity-lab/amity-agentic/runs/tau-bench-retail-agent-005) to [Weights & Biases](https://wandb.ai/amity-lab/amity-agentic) ([docs](https://wandb.me/developer-guide))
View project at https://wandb.ai/amity-lab/amity-agentic
View run at https://wandb.ai/amity-lab/amity-agentic/runs/tau-bench-retail-agent-005
INFO 07-24 11:05:01 [__init__.py:244] Automatically detected platform cuda.
WARNING 07-24 11:05:07 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 07-24 11:05:08 [__init__.py:244] Automatically detected platform cuda.
/root/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/9c925d64d72725edaf899c6cb9c377fd0709d9c5
INFO 07-24 11:05:18 [config.py:823] This model supports multiple tasks: {'generate', 'reward', 'classify', 'score', 'embed'}. Defaulting to 'generate'.
INFO 07-24 11:05:18 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 07-24 11:05:19 [utils.py:2597] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
WARNING 07-24 11:05:20 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 07-24 11:05:22 [__init__.py:244] Automatically detected platform cuda.
INFO 07-24 11:05:24 [core.py:455] Waiting for init message from front-end.
INFO 07-24 11:05:24 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Qwen/Qwen3-8B', speculative_config=None, tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
INFO 07-24 11:05:24 [worker_base.py:590] Injected <class 'art.vllm.engine.WorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['run', 'time']
WARNING 07-24 11:05:24 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f4d72d1a710>
INFO 07-24 11:05:25 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 07-24 11:05:25 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 07-24 11:05:25 [gpu_model_runner.py:1595] Starting to load model Qwen/Qwen3-8B...
INFO 07-24 11:05:25 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 07-24 11:05:25 [cuda.py:252] Using Flash Attention backend on V1 engine.
INFO 07-24 11:05:26 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.18it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.18it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.20it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.49it/s]

INFO 07-24 11:05:30 [default_loader.py:272] Loading weights took 3.38 seconds
INFO 07-24 11:05:30 [gpu_model_runner.py:1624] Model loading took 15.2683 GiB and 4.408849 seconds
INFO 07-24 11:05:36 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/7a29998acf/rank_0_0 for vLLM's torch.compile
INFO 07-24 11:05:36 [backends.py:472] Dynamo bytecode transform time: 6.37 s
INFO 07-24 11:05:42 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 4.985 s
INFO 07-24 11:05:42 [monitor.py:34] torch.compile takes 6.37 s in total
INFO 07-24 11:05:43 [gpu_worker.py:227] Available KV cache memory: 97.78 GiB
INFO 07-24 11:05:43 [kv_cache_utils.py:715] GPU KV cache size: 712,032 tokens
INFO 07-24 11:05:43 [kv_cache_utils.py:719] Maximum concurrency for 40,960 tokens per request: 17.38x
INFO 07-24 11:05:58 [gpu_model_runner.py:2048] Graph capturing finished in 15 secs, took 0.75 GiB
INFO 07-24 11:05:58 [core.py:171] init engine (profile, create kv cache, warmup model) took 28.58 seconds
INFO 07-24 11:05:59 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 44502
Loading training tasks...
Training on 30 tasks
Validation on 10 tasks
Iterating dataset:   0%
 0/400 [00:00<?, ?batch/s]

--- Training Step 0 (Epoch 0, Step 0) ---

--- Evaluating at Step 0 ---
Evaluating model on 10 tasks...
gather: 100%
 10/10 [00:44<00:00,  6.11s/it, reward=0.1, total_steps=13.8, final_prompt_tokens=7034.0, avg_completion_tokens=74.9, max_completion_tokens=314, outcome_correct=0.2, forced_stop=0.1, duration=21.3, completion_tokens=74.9]
Eval task 30: reward=0.0
Eval task 31: reward=0.0
Eval task 32: reward=0.0
Eval task 33: reward=-1
Eval task 34: reward=0.0
Eval task 35: reward=0.0
Eval task 36: reward=1.0
Eval task 37: reward=1.0
Eval task 38: reward=0.0
Eval task 39: reward=0.0
Average evaluation reward: 0.1
Generating trajectories for 4 tasks...
gather: 100%
 16/16 [01:09<00:00,  6.31s/it, reward=-0.125, total_steps=17.8, final_prompt_tokens=7.89e+3, avg_completion_tokens=92.1, max_completion_tokens=390, outcome_correct=0.0625, forced_stop=0.188, duration=33.1, completion_tokens=92.1]
wandb: WARNING Tried to log to step 0 that is less than the current step 1. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.
wandb: WARNING Tried to log to step 0 that is less than the current step 1. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.
Training on 4 trajectory groups...
Packed 12 trajectories into 8 sequences of length 16384
train:   0%
 0/8 [00:00<?, ?it/s]
ERROR 07-24 11:08:21 [core.py:583] Invocation of collective_rpc method failed
ERROR 07-24 11:08:21 [core.py:583] Traceback (most recent call last):
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 580, in _handle_client_request
ERROR 07-24 11:08:21 [core.py:583]     output.result = method(
ERROR 07-24 11:08:21 [core.py:583]                     ^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 347, in collective_rpc
ERROR 07-24 11:08:21 [core.py:583]     return self.model_executor.collective_rpc(method, timeout, args,
ERROR 07-24 11:08:21 [core.py:583]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-24 11:08:21 [core.py:583]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-24 11:08:21 [core.py:583]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 07-24 11:08:21 [core.py:583]     return func(*args, **kwargs)
ERROR 07-24 11:08:21 [core.py:583]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/src/art/vllm/engine.py", line 131, in run
ERROR 07-24 11:08:21 [core.py:583]     return func(*args, **kwargs)
ERROR 07-24 11:08:21 [core.py:583]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/src/art/torchtune/service.py", line 291, in sleep
ERROR 07-24 11:08:21 [core.py:583]     worker.sleep(level)
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/dev/tau-bench/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 93, in sleep
ERROR 07-24 11:08:21 [core.py:583]     allocator.sleep(offload_tags=("weights", ) if level == 1 else tuple())
ERROR 07-24 11:08:21 [core.py:583]   File "/workspace/Multi-Turn-RL-Agent/tau_bench/ART/src/art/vllm/patches.py", line 54, in sleep
ERROR 07-24 11:08:21 [core.py:583]     cpu_backup_tensor = torch.empty(
ERROR 07-24 11:08:21 [core.py:583]                         ^^^^^^^^^^^^
ERROR 07-24 11:08:21 [core.py:583] RuntimeError: CUDA error: invalid argument
ERROR 07-24 11:08:21 [core.py:583] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 07-24 11:08:21 [core.py:583] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 07-24 11:08:21 [core.py:583] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 07-24 11:08:21 [core.py:583]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot train the tau-bench datasets on qwen3 model #270

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot train the tau-bench datasets on qwen3 model #270

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions