Performance w/ langchain? #1822
Replies: 4 comments 4 replies
-
I think this has nothing to do with llama.cpp . |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
There's two wrappers available for loading llama.cpp and GGML models from Python: They both support GPU offload. llama-cpp-python only supports models that llama.cpp supports. ctransformers supports those, plus also all the models supported by the separate ggml library (MPT, Starcoder, Replit, GPT-J, GPT-NeoX, and others) ctransformers is designed to be as close as possible a drop-in replacement for Hugging Face transformers, and is compatible with LlamaTokenizer, so you might want to start with that. |
Beta Was this translation helpful? Give feedback.
-
Note: If you forget the 1 --n_gpu_layers 1` then CPU will be used |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Am I doing this correctly with langchain? In particular, am I using the optimized CPP version of llama, or the python version?
I'm using the 13b version an a sup'ed up M2 Pro and it is sloooooow. As in, it takes about one minute to make a simple query.
It uses the
low_cpu_mem_usage
with the offline option.Beta Was this translation helpful? Give feedback.
All reactions