Which quantized gguf size infers at INT8? #5662

segmond · 2024-02-22T14:47:49Z

segmond
Feb 22, 2024

I see references to GGUF_TYPE_INT8 in ggml.c. Is there any particular gguf size that is guaranteed to be in that or does it have to be generated a certain way?

sorasoras · 2024-02-22T18:13:01Z

sorasoras
Feb 22, 2024

I think that's ARM Cpu thing.

1 reply

segmond Feb 22, 2024
Author

Got it, would it be possible to infer in int8? and how can we support that? Telsa P40 int8 yields about 47 teraflops, RTX 3090 f16/f32 have about 35+ teraflops. Theoretically it sounds like we should see better performance from the P40 than 3090 if we have our data in the right format.

BarfingLemurs · 2024-02-22T18:26:28Z

BarfingLemurs
Feb 22, 2024

https://github.com/ggerganov/llama.cpp/pull/4966/files

It should work on certain ARM platforms with Q4_0 and Q8_0 (at least) as shown in the discussions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Which quantized gguf size infers at INT8? #5662

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Which quantized gguf size infers at INT8? #5662

Uh oh!

segmond Feb 22, 2024

Replies: 2 comments · 1 reply

Uh oh!

sorasoras Feb 22, 2024

Uh oh!

segmond Feb 22, 2024 Author

Uh oh!

BarfingLemurs Feb 22, 2024

segmond
Feb 22, 2024

Replies: 2 comments 1 reply

sorasoras
Feb 22, 2024

segmond Feb 22, 2024
Author

BarfingLemurs
Feb 22, 2024