Replies: 4 comments 10 replies
-
Beta Was this translation helpful? Give feedback.
-
README.md#metal-build documentation indicates that for MacOS that "When built with Metal support, you can explicitly disable GPU inference with the |
Beta Was this translation helpful? Give feedback.
-
In the case of speculative sampling, would it be possible to offload the
larger model to GPU while the smaller model(s) are utilising CPUs ?
… —
Reply to this email directly, view it on GitHub
<#3083 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4BACEHGRAKP2OXKF6LXZQD27ANCNFSM6AAAAAA4QSVWPA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
This is not supported on Mac and it is very unlikely to bring any benefit even if it was supported. Best thing you can try is to run a large LLM on the GPU (i.e. The main reason is that the memory bandwidth of the chip is shared between the CPU and GPU (AFAIK), so if you already saturated it with the GPU, then the CPU won't help |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm running 8-bit quantized Llama 2 and have a 99% utilized GPU, 12 performance cores idle, as well as an idle neural engine. Could we use the existing code for dividing work up on the CPU and GPU concurrently? Could the links below allow use of the Neural Engine too?
https://developer.apple.com/library/archive/documentation/Performance/Conceptual/vDSP_Programming_Guide/Introduction/Introduction.html
https://developer.apple.com/documentation/accelerate/veclib
I'd like to squeeze all the fixed-point operations per second I can out of my M2, and it seems we have the ability to run on the CPU and the GPU, but no code path to handle both, or with the Neural Engine, all. How challenging is this to do?
Beta Was this translation helpful? Give feedback.
All reactions