Fast embedding, sampling, and freq_cis computation of decoding on CPU #8782

kaizizzzzzz · 2024-07-30T21:57:48Z

kaizizzzzzz
Jul 30, 2024

Hi, I have compared the latency of embedding and freq_cis (which are the former steps before computing the transformer layers) on CPU and GPU in "gpt-fast" repo. The latency of CPU is still much bigger than doing on GPU even on the decoding stage (do not need to access multiple token_id, but only one). So I'm curious about the extra latency and wondering if LLAMA.CPP has some optimizations to speedup these computations on CPU, because I think this non-computation step should also be fast on CPU.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fast embedding, sampling, and freq_cis computation of decoding on CPU #8782

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Fast embedding, sampling, and freq_cis computation of decoding on CPU #8782

Uh oh!

kaizizzzzzz Jul 30, 2024

Replies: 0 comments

kaizizzzzzz
Jul 30, 2024