Is quantization performed locally ? #6583
-
Hello, pn = PromptNode(model_name_or_path="meta-llama/Llama-2-7b-chat-hf", model_kwargs={"stream":True,'device':None,'load_in_4bit':True, "device_map": "auto"} ,max_length=256) I have a question regarding the "load_in_4bit" flag. I'm curious to know whether, when this flag is enabled, the model files are first loaded onto the local machine and then quantization is performed locally, or if the quantized model configuration is automatically downloaded from Hugging Face. Thank you in advance for providing clarification. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hello, @ciliamadani...
There are other options that download and run a pre-quantized model. (Not all are supported in Haystack). For a complete overview, read this good blog post. |
Beta Was this translation helpful? Give feedback.
Hello, @ciliamadani...
load_in_4bit
is a flag that makes the Hugging Face model run with a 4-bit quantization using thebitsandbytes
library.The model is downloaded in its full size and then executed in a quantized version.
There are other options that download and run a pre-quantized model. (Not all are supported in Haystack).
For a complete overview, read this good blog post.