[BUG] Runtime error when trying to load Qwen3 32B

### OS

Windows

### GPU Library

CUDA 12.x

### Python version

3.12

### Pytorch version

2.6.0+cu124

### Model

https://huggingface.co/CAPsMANyo/Qwen3-32B_exl2/tree/4.25

### Describe the bug

When trying to load the new Qwen3 32B model, it loads then immediately crashes with the given error.

### Reproduction steps

Just by running `start.sh`

### Expected behavior

It was expected to load successfully

### Logs

```
Activating venv
pip 24.3.1 from /home/thomas/tabbyAPI_NEW/venv/lib/python3.12/site-packages/pip (python 3.12)
Loaded your saved preferences from `start_options.json`
Starting TabbyAPI...
2025-04-30 13:18:44.754 INFO:     ExllamaV2 version: 0.2.8
2025-04-30 13:18:44.786 WARNING:  Disabling authentication makes your instance vulnerable. Set the `disable_auth` flag to False in config.yml if you want to share
this instance with others.
2025-04-30 13:18:44.787 INFO:     Generation logging is enabled for: prompts
2025-04-30 13:18:44.817 WARNING:  Draft model is disabled because a model name wasn't provided. Please check your config.yml!
2025-04-30 13:18:44.818 WARNING:  An unsupported GPU is found in this configuration. Switching to compatibility mode.
2025-04-30 13:18:44.818 WARNING:  This disables parallel batching and features that rely on it (ex. CFG).
2025-04-30 13:18:44.818 WARNING:  To disable compatability mode, all GPUs must be ampere (30 series) or newer. AMD GPUs are not supported.
2025-04-30 13:18:44.819 INFO:     Attempting to load a prompt template if present.
2025-04-30 13:18:44.820 WARNING:  TemplateLoadError: Model JSON path "/home/thomas/text-generation-webui/models/Qwen3-32B-exl/chat_template.json" not found.
2025-04-30 13:18:44.842 INFO:     Using template "from_tokenizer_config" for chat completions.
2025-04-30 13:18:45.388 INFO:     Loading model: /home/thomas/text-generation-webui/models/Qwen3-32B-exl
2025-04-30 13:18:45.388 INFO:     Loading with tensor parallel
Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 131/131 0:00:00
Traceback (most recent call last):
  File "/home/thomas/tabbyAPI_NEW/start.py", line 291, in <module>
    entrypoint(args, parser)
  File "/home/thomas/tabbyAPI_NEW/main.py", line 166, in entrypoint
    asyncio.run(entrypoint_async())
  File "/home/thomas/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/thomas/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/thomas/tabbyAPI_NEW/main.py", line 71, in entrypoint_async
    await model.load_model(
  File "/home/thomas/tabbyAPI_NEW/common/model.py", line 112, in load_model
    async for _ in load_model_gen(model_path, **kwargs):
  File "/home/thomas/tabbyAPI_NEW/common/model.py", line 90, in load_model_gen
    async for module, modules in load_status:
  File "/home/thomas/tabbyAPI_NEW/backends/exllamav2/model.py", line 570, in load_gen
    async for value in iterate_in_threadpool(model_load_generator):
  File "/home/thomas/tabbyAPI_NEW/common/concurrency.py", line 30, in iterate_in_threadpool
    yield await asyncio.to_thread(gen_next, generator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/thomas/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/thomas/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/thomas/tabbyAPI_NEW/common/concurrency.py", line 20, in gen_next
    return next(generator)
           ^^^^^^^^^^^^^^^
  File "/home/thomas/tabbyAPI_NEW/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 57, in generator_context
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/home/thomas/tabbyAPI_NEW/backends/exllamav2/model.py", line 731, in load_model_sync
    self.model.forward(input_ids, cache=self.cache, preprocess_only=True)
  File "/home/thomas/tabbyAPI_NEW/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/thomas/tabbyAPI_NEW/venv/lib/python3.12/site-packages/exllamav2/model.py", line 900, in forward
    r = self.forward_chunk(
        ^^^^^^^^^^^^^^^^^^^
  File "/home/thomas/tabbyAPI_NEW/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^
```

### Additional context

_No response_

### Acknowledgements

- [x] I have looked for similar issues before submitting this one.
- [x] I understand the developers of this program are human, and I will ask my questions politely.
- [x] I understand that the developers have lives and my issue will be answered when possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Runtime error when trying to load Qwen3 32B #784

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Runtime error when trying to load Qwen3 32B #784

Description

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions