Errors encountered during API calls while running DeepSeek R1:671b in multi-GPU mode(RTX4090*2) #968

lililolo0927 · 2025-03-24T10:25:12Z

lililolo0927
Mar 24, 2025

Hello
I am using a container where Ktransformers version0.2.1 is running as API server (referring to https://github.com/ubergarm/r1-ktransformers-guide) .
I launched the server to run DeepSeek-R1:671b(Q4) with two gpus (4090) with the following command,

 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py \
    --gguf_path /models/gguf/DeepSeek-R1-GGUF \
    --model_path deepseek-ai/DeepSeek-R1 \
    --model_name DeepSeek-R1 \
    --cpu_infer 16 \
	--total_context 65536 \
	--temperature 0.1 \
	--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml \
    --cache_q4 true \
    --host 0.0.0.0 \
    --port 10002

Before using the multi-GPU option, the request was executed successfully with the following curl command:

curl -X 'POST' \
  'http://localhost:10002/api/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "DeepSeek-V3",
  "prompt": "1+1=?",
  "stream": true
}'

After adding the multi-GPU option, while nvidia-smi shows that these processes are distributed across two GPUs(as shown on the above capture) , the following error continues to occur.

and the below error in the server side ,

/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py:105: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
  self.gen = func(*args, **kwds)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/responses.py", line 268, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/responses.py", line 264, in wrap
    await func()
  File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/responses.py", line 233, in listen_for_disconnect
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/venv/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 531, in receive
    await self.message_event.wait()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/locks.py", line 213, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 6ff29c40e410

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/workspace/ktransformers/ktransformers/util/cuda_graph_runner.py", line 41, in capture
    |     logits=model(inputs_embeds=inputs_embeds,
    |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/workspace/ktransformers/ktransformers/models/modeling_deepseek_v3.py", line 1688, in forward
    |     outputs = self.model(
    |               ^^^^^^^^^^^
    |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/workspace/ktransformers/ktransformers/operators/models.py", line 680, in forward
    |     hidden_states = hidden_states.to(
    |                     ^^^^^^^^^^^^^^^^^
    | RuntimeError: CUDA error: operation not permitted when stream is capturing
    | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    | For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    |
    |
    | During handling of the above exception, another exception occurred:
    |
    | Traceback (most recent call last):
    |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/workspace/ktransformers/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
    |     async for event in async_events:
    |   File "/workspace/ktransformers/ktransformers/server/api/ollama/completions.py", line 66, in inner
    |     async for token in interface.inference(input.prompt,id):
    |   File "/workspace/ktransformers/ktransformers/server/backend/interfaces/ktransformers.py", line 212, in inference
    |     async for v in super().inference(local_messages, thread_id):
    |   File "/workspace/ktransformers/ktransformers/server/backend/interfaces/transformers.py", line 397, in inference
    |     for t in self.generate():
    |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 57, in generator_context
    |     response = gen.send(request)
    |                ^^^^^^^^^^^^^^^^^
    |   File "/workspace/ktransformers/ktransformers/server/backend/interfaces/transformers.py", line 341, in generate
    |     next_token = self.decode_one_tokens()
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/workspace/ktransformers/ktransformers/server/backend/interfaces/ktransformers.py", line 88, in decode_one_tokens
    |     self.cuda_graph_runner.capture(
    |   File "/workspace/ktransformers/ktransformers/util/cuda_graph_runner.py", line 40, in capture
    |     with torch.cuda.graph(self.graph, stream = capture_stream):
    |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/torch/cuda/graphs.py", line 186, in __exit__
    |     self.cuda_graph.capture_end()
    |   File "/workspace/ktransformers/venv/lib/python3.11/site-packages/torch/cuda/graphs.py", line 84, in capture_end
    |     super().capture_end()
    | RuntimeError: Capture must end on the same stream it began on.
    +------------------------------------

Why does it happend and what should I do to use two gpus ?

In addition, the 2nd question is , as you can see in the following captures (running server on the single gpu and two gpus, repectively) ,

it is not really using full GPU while still speed is very slow.. (I thought the slowing down is due to the lack of VRAM?) , why only the limited amount of VRAM (almost fixed quantity) is consumed for running model?

Thank you in advance.

cyhasuka · 2025-03-26T03:57:13Z

cyhasuka
Mar 26, 2025

Seems like cudagraph err? Try to update codebase and use DeepSeek-V3-Chat-multi-gpu-marlin.yaml instead, it should provide better performance.

why only the limited amount of VRAM (almost fixed quantity) is consumed for running model?

Ktransformers use cpu to calculate the expert layer. So if the VRAM on single card is enough, add GPUs will not increase speed.

7 replies

lililolo0927 Mar 26, 2025
Author

I updated Ktransformers with container image [ approachingai/ktransformers:v0.2.2rc1-AVX512 ]
and
ran with same command as

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/ktransformers/server/main.py \
--gguf_path /models/gguf/DeepSeek-R1-GGUF \
--model_path deepseek-ai/DeepSeek-R1 \
--model_name DeepSeek-R1 \
--cpu_infer 16 \
--total_context 65536 \
--optimize_config_path ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml \
--cache_q4 true \
--host 0.0.0.0 \
--port 10002

It still returned error but a bit different to the previous, (query is same to the question)

2025-03-26 18:05:10,234 DEBUG /workspace/ktransformers/ktransformers/server/backend/interfaces/ktransformers.py[133]: input_ids: torch.Size([1, 5])
2025-03-26 18:05:10,234 DEBUG /workspace/ktransformers/ktransformers/server/backend/interfaces/ktransformers.py[159]: same prefix len: 4
2025-03-26 18:05:10,243 DEBUG /workspace/ktransformers/ktransformers/server/backend/interfaces/ktransformers.py[168]: input_ids: torch.Size([1, 1])
2025-03-26 18:05:10,243 DEBUG /workspace/ktransformers/ktransformers/server/backend/interfaces/ktransformers.py[170]: generate_ids: torch.Size([1, 4])
2025-03-26 18:05:10,243 DEBUG /workspace/ktransformers/ktransformers/server/backend/interfaces/ktransformers.py[181]: cache position: 4 to 5
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/starlette/responses.py", line 268, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/opt/conda/lib/python3.10/site-packages/starlette/responses.py", line 264, in wrap
    await func()
  File "/opt/conda/lib/python3.10/site-packages/starlette/responses.py", line 233, in listen_for_disconnect
    message = await receive()
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 531, in receive
    await self.message_event.wait()
  File "/opt/conda/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 75a9b41d65c0

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/opt/conda/lib/python3.10/site-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/opt/conda/lib/python3.10/site-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/opt/conda/lib/python3.10/site-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/workspace/ktransformers/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
    |     async for event in async_events:
    |   File "/workspace/ktransformers/ktransformers/server/api/ollama/completions.py", line 66, in inner
    |     async for token in interface.inference(input.prompt,id):
    |   File "/workspace/ktransformers/ktransformers/server/backend/interfaces/ktransformers.py", line 217, in inference
    |     async for v in super().inference(local_messages, thread_id):
    |   File "/workspace/ktransformers/ktransformers/server/backend/interfaces/transformers.py", line 389, in inference
    |     for t in self.prefill(input_ids, self.check_is_new(thread_id)):
    |   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    |     response = gen.send(None)
    |   File "/workspace/ktransformers/ktransformers/server/backend/interfaces/ktransformers.py", line 193, in prefill
    |     logits = self.model(
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/workspace/ktransformers/ktransformers/models/modeling_deepseek_v3.py", line 1688, in forward
    |     outputs = self.model(
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/workspace/ktransformers/ktransformers/operators/models.py", line 727, in forward
    |     layer_outputs = decoder_layer(
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/workspace/ktransformers/ktransformers/models/modeling_deepseek_v3.py", line 1202, in forward
    |     hidden_states = self.input_layernorm(hidden_states)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/workspace/ktransformers/ktransformers/models/modeling_deepseek_v3.py", line 108, in forward
    |     return self.weight * hidden_states.to(input_dtype)
    | RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
    +------------------------------------

It was running very well with one gpu .. Should I use other command to submit query?
Thanks for ur helps!

cyhasuka Mar 27, 2025

It looks like the tensor is being loaded incorrectly to different devices, if u haven't changed anything in DeepSeek-V3-Chat-multi-gpu-marlin.yaml then please try again after updating the codebase to the latest version, v0.2.3post2, which has a lot of bugfixes and is the relatively most stable version.
If u wish to build the project yourself (recommended), just entering into your current container, pull this repo, and compiling. For detail, u can refer to the official instructions or my other answer
If u wish to download the image directly, please move to https://hub.docker.com/r/approachingai/ktransformers/tags And match the most appropriate architecture.

cyhasuka Mar 27, 2025

BTW, the param total_context and cache_q4 is not currently in effect in the codebase and only exists as a placeholder. This should have nothing to do with your reported error though.

lililolo0927 Apr 1, 2025
Author

It works with the newest version of docker image(V.2.3) and the following cmd,
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/ktransformers/server/main.py
--gguf_path /models/gguf/DeepSeek-V3-Q4_K_M
--model_path deepseek-ai/DeepSeek-V3
--model_name DeepSeek-R3
--cpu_infer 17
--total_context 32768
--optimize_config_path ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml
--no-use_cuda_graph
--host 0.0.0.0
--port 10002

Thanks :)

cyhasuka Apr 1, 2025

You're welcome, if my answer is useful, u can click a "Mark as Answer", maybe it can help others! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Errors encountered during API calls while running DeepSeek R1:671b in multi-GPU mode(RTX4090*2) #968

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Errors encountered during API calls while running DeepSeek R1:671b in multi-GPU mode(RTX4090*2) #968

Uh oh!

Uh oh!

lililolo0927 Mar 24, 2025

Replies: 1 comment · 7 replies

Uh oh!

cyhasuka Mar 26, 2025

Uh oh!

lililolo0927 Mar 26, 2025 Author

Uh oh!

cyhasuka Mar 27, 2025

Uh oh!

cyhasuka Mar 27, 2025

Uh oh!

lililolo0927 Apr 1, 2025 Author

Uh oh!

cyhasuka Apr 1, 2025

lililolo0927
Mar 24, 2025

Replies: 1 comment 7 replies

cyhasuka
Mar 26, 2025

lililolo0927 Mar 26, 2025
Author

lililolo0927 Apr 1, 2025
Author