Facing problem while using asynchronous streaming with ChatOllama #31632
Unanswered
sourav-eegrab
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
I’ve successfully achieved asynchronous behavior in the application with
num_gpu=48
. In this configuration, the system offloads part of the workload to the CPU, which enables async execution—though it results in a slightly slower startup. However, when increasingnum_gpu
beyond 48, the application switches to synchronous execution but performs noticeably faster. This behavior was observed and validated through parallel runs across multiple systems. It's also worth noting that, in both cases, certain tasks are still being offloaded to the CPU.How to achieve asynchronous behavior while the model is fully offloaded in the gpu? @baskaryan @hwchase17 any help is highly appriciated.
System configuration:
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies
Beta Was this translation helpful? Give feedback.
All reactions