Force All Computations to Run on GPU during Partial Offloading #11442
Replies: 2 comments
-
This will also significantly speed up token generation when |
Beta Was this translation helpful? Give feedback.
-
MMM. I think this is already the case then trying to offload everything onto a single GPU when you don't have enough VRAM. Although it is not intuitive right now it can be done by looking at the llama.cpp output on the terminal and finding the number of layers of the AI model, then running llama.cpp with the -ngl command line argument with the number of layers of the ai model and doing this when running llama with the unified memory environment variable just before the llama.cpp command on linux It works but it's worse with a 3060 mobile GPU than hybrid inference because, the gpu has to wait for the cpu to swap out memory from the RAM into the VRAM. But then the GPU is only connected with PCIe 3.0 X8 so that might be a big bottleneck. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I propose adding an option in the form of a command-line argument to force all computations onto the GPU during partial offloading with CPU RAM as an offload buffer. This would allow us to maintain high performance and efficiency even when dealing with models that are too large for the available VRAM.
Motivation
High Bandwidth PCIe 5.0: The upcoming Nvidia and AMD consumer graphics cards support PCIe 5.0 x16, offering up to 64 GB/s bandwidth. The increased bandwidth allows for faster data transfer between the CPU and GPU, which can mitigate some of the overhead associated with transferring model layers across the PCIe bus during token generation. Additional PCIe slots will further increase the available bandwidth, and older versions of PCIe could significantly benefit as well.
Speculative Decoding: Speculative decoding can benefit significantly from the additional compute provided by the GPU, because the evaluation of the draft tokens by the larger model is done in a batched fashion and would otherwise be compute bound and ineffective on CPU.
Mixture-of-Experts Models + Multi-Token Prediction: The growing popularity of mixture-of-experts models and multi-token prediction methods, thanks to new the releases from DeepSeek, suggests a potential for much higher throughput even with very large parameter models.
All of these advancements combined, it is conceivable that we can get usable tokens/second with very high parameter models using partial offloading between GPU VRAM and CPU RAM if this enhancement is made.
Implementation
This is partially inspired by someone who had shown this was feasible by doing something very similar here:
https://github.com/Infini-AI-Lab/UMbreLLa
Beta Was this translation helpful? Give feedback.
All reactions