Replies: 1 comment 1 reply
-
I would expect the load time of the RPC servers to be mainly limited by the network bandwidth, so unless you have multiple NICs with direct connections to each server, I don't think this is likely to help significantly. The best way to reduce the load time would be to implement a tensor cache in the server (as previously mentioned in #9740 (comment)). |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I'm working on asynchronously launching ggml_backend_tensor_set in llama-model-loader.cpp using a thread pool. Currently, these calls are executed sequentially in one group. Previously, I tried to create a unique queue for each device and process the data sequentially, but that didn't yield the expected results.
Right now, I'm temporarily using a single shared queue, but unfortunately, this doesn't solve the problem — I'm encountering a SIGSEGV error. I will continue to look for a solution.
I wanted to ask if anyone is currently working on Multithreading Offloading? I would appreciate any advice or ideas!
Why multithreading? because we have rpc-server devices (in 4 pc, load so much time)
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions