Replies: 1 comment
-
no-mmap improved inference by 1 t/s. still slower than it used to be |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I used to get around 6.5-7.5 tokens a second on a llama2-70b model. now I get 3.5 t/s. (cpu only)
I've been out of the AI game for a little over a month so my system software has updated a lot.
The things that have changed on my end is I'm now using gguf instead of ggml so maybe I missed when the change is actually happening, but I feel like that isn't it. maybe it's the memory mapping? I've heard that's slower.
I can try it with --no-mmap to see if that makes a difference, but I was wondering if anyone has an idea.
Also side note will Intel MKL use both my cpu and gpu? I was actually wondering if intel oneapi might be better for me and then saw today that it looks like that's what Intel MKL is. right now I build with openblas support and get blas=1 when running, but running with and without blas doesn't seem to make a difference.
Beta Was this translation helpful? Give feedback.
All reactions