What techniques exist for running a large language model (LLM, 20GB+) on a resource-constrained GPU (8GB)? #6124
Unanswered
BecauseTheWorldIsRound
asked this question in
Q&A
Replies: 1 comment 2 replies
-
Well you are at the right place here. Llama.cpp makes it possible with partial offloading. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
How can I use a large language model (LLM, 20GB+) for inference on a machine with a smaller GPU (8GB)?
Are there ways to break computations down for efficient processing?
Thank you
Beta Was this translation helpful? Give feedback.
All reactions