Force All Computations to Run on GPU during Partial Offloading #11442

TheAiSandbox · 2025-01-27T02:34:08Z

TheAiSandbox
Jan 27, 2025

I propose adding an option in the form of a command-line argument to force all computations onto the GPU during partial offloading with CPU RAM as an offload buffer. This would allow us to maintain high performance and efficiency even when dealing with models that are too large for the available VRAM.

Motivation

High Bandwidth PCIe 5.0: The upcoming Nvidia and AMD consumer graphics cards support PCIe 5.0 x16, offering up to 64 GB/s bandwidth. The increased bandwidth allows for faster data transfer between the CPU and GPU, which can mitigate some of the overhead associated with transferring model layers across the PCIe bus during token generation. Additional PCIe slots will further increase the available bandwidth, and older versions of PCIe could significantly benefit as well.
Speculative Decoding: Speculative decoding can benefit significantly from the additional compute provided by the GPU, because the evaluation of the draft tokens by the larger model is done in a batched fashion and would otherwise be compute bound and ineffective on CPU.
Mixture-of-Experts Models + Multi-Token Prediction: The growing popularity of mixture-of-experts models and multi-token prediction methods, thanks to new the releases from DeepSeek, suggests a potential for much higher throughput even with very large parameter models.

All of these advancements combined, it is conceivable that we can get usable tokens/second with very high parameter models using partial offloading between GPU VRAM and CPU RAM if this enhancement is made.

Implementation

This is partially inspired by someone who had shown this was feasible by doing something very similar here:
https://github.com/Infini-AI-Lab/UMbreLLa

TheAiSandbox · 2025-01-27T02:58:39Z

TheAiSandbox
Jan 27, 2025
Author

This will also significantly speed up token generation when --parallel is used to process multiple prompt and the model is too large to fully fit into VRAM.

0 replies

ejrydhfs · 2025-02-28T02:06:05Z

ejrydhfs
Feb 28, 2025

MMM. I think this is already the case then trying to offload everything onto a single GPU when you don't have enough VRAM.

Although it is not intuitive right now it can be done by looking at the llama.cpp output on the terminal and finding the number of layers of the AI model, then running llama.cpp with the -ngl command line argument with the number of layers of the ai model and doing this when running llama with the unified memory environment variable just before the llama.cpp command on linux

It works but it's worse with a 3060 mobile GPU than hybrid inference because, the gpu has to wait for the cpu to swap out memory from the RAM into the VRAM. But then the GPU is only connected with PCIe 3.0 X8 so that might be a big bottleneck.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force All Computations to Run on GPU during Partial Offloading #11442

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Force All Computations to Run on GPU during Partial Offloading #11442

Uh oh!

TheAiSandbox Jan 27, 2025

Motivation

Implementation

Replies: 2 comments

Uh oh!

TheAiSandbox Jan 27, 2025 Author

Uh oh!

Uh oh!

ejrydhfs Feb 28, 2025

TheAiSandbox
Jan 27, 2025

TheAiSandbox
Jan 27, 2025
Author

ejrydhfs
Feb 28, 2025