Can ggllm.cpp run Falcon on Apple Silicon? #63
-
As I understand it, ggllm.cpp is a fork of llama.cpp intended to run Falcon models, and llama.cpp can run Llama-derived models on Apple Silicon (M1/M2 Macs), and can even run them using the Apple neural engine cores rather than the CPU cores. Can ggllm.cpp run Falcon models on Apple Silicon? Obviously Falcon 40B needs a lot of VRAM to run on GPU, so the Apple approach of sharing the system RAM rather than specialized VRAM is appealing: a fair number of modern Macs have 64gb or more of RAM. If this currently doesn't work, is it something that might get added, and if so on what time-frame? Or if it does work, could you add Apple Silicon build/setup instructions? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Falcon is quite VRAM friendly compared to llama, it uses MQA which requires a fraction of memory and it appears to be less insulted on heavy quantization than typical llama. Regarding running on MAC: I've been told it works. I don't have one here for testing but others report it runs fast on 'Metal'. |
Beta Was this translation helpful? Give feedback.
Falcon is quite VRAM friendly compared to llama, it uses MQA which requires a fraction of memory and it appears to be less insulted on heavy quantization than typical llama.
You can run a quite high quality Falcon 40B on less than 24GB RAM. Of course 64GB is amazing, 36GB are required to run it in very high quality which is beyond most single GPUs.
Regarding running on MAC: I've been told it works. I don't have one here for testing but others report it runs fast on 'Metal'.
You should be able to build it similar as llama.cpp, you might need to disable cublas manually in addition to activating metal.