You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am studying the source code of llama.cpp. I came across the part related to the thread pool in the code, and I want to understand how multithreading helps improve performance during computation. When performing inference, I tried setting different -t parameters to use different numbers of threads. I noticed that a larger number of threads improves the performance during the prompt process, but there is no performance improvement during the generation process. I am currently using CPU for inference. I understand that the difference between the prompt and generation processes lies in the sequence_len dimension. I would like to know if the reason multithreading can accelerate the prompt process is because it allows parallel processing of multiple tokens along the sequence_len dimension? Additionally, I have two other questions:
In a pure CPU inference process, can multithreading work without using the OPENMP compilation flag?
Once the compute graph for an inference is built, how is it distributed across different threads for parallel processing?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am studying the source code of llama.cpp. I came across the part related to the thread pool in the code, and I want to understand how multithreading helps improve performance during computation. When performing inference, I tried setting different
-t
parameters to use different numbers of threads. I noticed that a larger number of threads improves the performance during the prompt process, but there is no performance improvement during the generation process. I am currently using CPU for inference. I understand that the difference between the prompt and generation processes lies in the sequence_len dimension. I would like to know if the reason multithreading can accelerate the prompt process is because it allows parallel processing of multiple tokens along the sequence_len dimension? Additionally, I have two other questions:Beta Was this translation helpful? Give feedback.
All reactions