Possible performance degradation. Also which BLAS implementation is right for me? #4239

sirus20x6 · 2023-11-27T17:41:02Z

sirus20x6
Nov 27, 2023

I used to get around 6.5-7.5 tokens a second on a llama2-70b model. now I get 3.5 t/s. (cpu only)
I've been out of the AI game for a little over a month so my system software has updated a lot.

The things that have changed on my end is I'm now using gguf instead of ggml so maybe I missed when the change is actually happening, but I feel like that isn't it. maybe it's the memory mapping? I've heard that's slower.

I can try it with --no-mmap to see if that makes a difference, but I was wondering if anyone has an idea.

Also side note will Intel MKL use both my cpu and gpu? I was actually wondering if intel oneapi might be better for me and then saw today that it looks like that's what Intel MKL is. right now I build with openblas support and get blas=1 when running, but running with and without blas doesn't seem to make a difference.

EPYC Milan-X 7473X 24-Core 2.8GHz 768MB L3

512GB of HMAA8GR7AJR4N-XN HYNIX 64GB (1X64GB) 2RX4 PC4-3200AA DDR4-3200MHz ECC RDIMMs

MZ32-AR0 Rev 3.0 motherboard

6x 20tb WD Red Pros on ZFS with zstd compression

SABRENT Gaming SSD Rocket 4 Plus-G with Heatsink 2TB PCIe Gen 4 NVMe M.2 2280

AMD 7900xtx

sirus20x6 · 2023-11-28T05:14:00Z

sirus20x6
Nov 28, 2023
Author

no-mmap improved inference by 1 t/s. still slower than it used to be

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible performance degradation. Also which BLAS implementation is right for me? #4239

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Possible performance degradation. Also which BLAS implementation is right for me? #4239

Uh oh!

sirus20x6 Nov 27, 2023

Replies: 1 comment

Uh oh!

sirus20x6 Nov 28, 2023 Author

sirus20x6
Nov 27, 2023

sirus20x6
Nov 28, 2023
Author