What techniques exist for running a large language model (LLM, 20GB+) on a resource-constrained GPU (8GB)? #6124

BecauseTheWorldIsRound · 2024-03-18T05:00:11Z

BecauseTheWorldIsRound
Mar 18, 2024

How can I use a large language model (LLM, 20GB+) for inference on a machine with a smaller GPU (8GB)?

Are there ways to break computations down for efficient processing?

Thank you

Dampfinchen · 2024-03-18T08:19:15Z

Dampfinchen
Mar 18, 2024

Well you are at the right place here. Llama.cpp makes it possible with partial offloading.

2 replies

BecauseTheWorldIsRound Mar 18, 2024
Author

Thanks, I assume you are talking about offloading when using the -ngl option with a lower number of layers.
The GPU has ~3000 CUDA cores and 8 GB VRAM, the CPU has 6 cores and 32 GB RAM.
The model is about 14 GB (n_layer is 32).

Offloading is indeed working (-ngl 16 -t 6 => offloaded 16/33 layers) but there is only a marginal speed improvement (maybe 1.5x) over full CPU.

My question was rather if it is possible for llama.cpp to switch back and forth RAM and VRAM for a token generation (for inference), doing partial computations using only the GPU cores... if that makes sense, and if that might improve the speed(?), the CPU would only mem copy.

marcingomulkiewicz Mar 19, 2024

Lets calculate.

PCIe 4 x16 pushes ~31.5GB/s (https://en.wikipedia.org/wiki/PCI_Express), so for a 8GB GPU each load would take at least ~0.25s. For, say, 24GB model it means ~0.75s overhead (3 parts * 0.25s, plus probably some overhead - and that assuming the model is kept in RAM all the time, otherwise we have disk throughput in our way). On 5800X3D according to llama-bench llama 34B/Q4_K_M (19G) processes 451 t/s, and generates 1.81 t/s - so one token per ~0.5s.

Even if calculations on GPU are instantaneous, it makes little sense I'm afraid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What techniques exist for running a large language model (LLM, 20GB+) on a resource-constrained GPU (8GB)? #6124

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

What techniques exist for running a large language model (LLM, 20GB+) on a resource-constrained GPU (8GB)? #6124

Uh oh!

BecauseTheWorldIsRound Mar 18, 2024

Replies: 1 comment · 2 replies

Uh oh!

Dampfinchen Mar 18, 2024

Uh oh!

BecauseTheWorldIsRound Mar 18, 2024 Author

Uh oh!

Uh oh!

marcingomulkiewicz Mar 19, 2024

BecauseTheWorldIsRound
Mar 18, 2024

Replies: 1 comment 2 replies

Dampfinchen
Mar 18, 2024

BecauseTheWorldIsRound Mar 18, 2024
Author