Performance w/ langchain? #1822

frankandrobot · 2023-06-12T14:16:33Z

frankandrobot
Jun 12, 2023

Am I doing this correctly with langchain? In particular, am I using the optimized CPP version of llama, or the python version?

I'm using the 13b version an a sup'ed up M2 Pro and it is sloooooow. As in, it takes about one minute to make a simple query.

It uses the low_cpu_mem_usage with the offline option.

!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install -v datasets loralib sentencepiece 
!pip -v install bitsandbytes accelerate
!pip -v install langchain
!pip install scipy
!pip install xformers
!pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

import torch

tokenizer = LlamaTokenizer.from_pretrained("/Users/uavalos/Documents/gpt4-x-alpaca")

base_model = LlamaForCausalLM.from_pretrained(
    "/Users/uavalos/Documents/gpt4-x-alpaca",
    #load_in_8bit=True,
    #load_in_8bit_fp32_cpu_offload=True,
    low_cpu_mem_usage=True,
    device_map='auto',
    offload_folder="offload"
)

pipe = pipeline(
    "text-generation",
    model=base_model, 
    tokenizer=tokenizer, 
    max_length=256,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.2
)

local_llm = HuggingFacePipeline(pipeline=pipe)

Green-Sky · 2023-06-12T14:18:25Z

Green-Sky
Jun 12, 2023
Collaborator

I think this has nothing to do with llama.cpp .

0 replies

frankandrobot · 2023-06-12T14:36:31Z

frankandrobot
Jun 12, 2023
Author

Under the hood I'm using LlamaTokenizer, etc.
My understanding is that it's just wrappers for llama.cpp. @Green-Sky
If not, how to use llama.cpp properly in python ?

0 replies

TheBloke · 2023-06-12T14:39:05Z

TheBloke
Jun 12, 2023

There's two wrappers available for loading llama.cpp and GGML models from Python:

They both support GPU offload. llama-cpp-python only supports models that llama.cpp supports. ctransformers supports those, plus also all the models supported by the separate ggml library (MPT, Starcoder, Replit, GPT-J, GPT-NeoX, and others)

ctransformers is designed to be as close as possible a drop-in replacement for Hugging Face transformers, and is compatible with LlamaTokenizer, so you might want to start with that.

0 replies

ianscrivener · 2023-06-12T22:52:03Z

ianscrivener
Jun 12, 2023

@frankandrobot

Install the conda version for MacOS that support Metal GPU

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh

Make a conda environment

conda create -n llama python=3.9.16
conda activate llama

Install the LATEST llama-cpp-python.. which, as of just today, happily supports MacOS Metal GPU
(you'll need xcode install for pip to build/compile the C++ code

pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
pip install 'llama-cpp-python[server]'

run the llama-cpp-python API server with MacOS Metal GPU support

# config your ggml model path
# make sure it is ggml v3
# make sure it is q4_0
export MODEL=[path to your llama.cpp ggml models]]/[ggml-model-name]]q4_0.bin
python3 -m llama_cpp.server --model $MODEL  --n_gpu_layers 1

Note: If you forget the 1 --n_gpu_layers 1` then CPU will be used

4 replies

frankandrobot Jun 13, 2023
Author

@ianscrivener thanks for the reply but that's not what I asked... trying to run llama w/ langchain so that I can connect it easily to a vector store.

ianscrivener Jun 13, 2023

@frankandrobot
I'm finding my way on all this as well... my understanding is that the dependancy hierarchy is;

   langchain (python)
      llama-cpp-python
         llama.cpp

So a slow langchain on M2/M1 would be either caused by llama.cpp or llama-cpp-python

The issue was in fact with llama-cpp-python... not llama.cpp.

Recent fixes to llama-cpp-python in the v0.1.62 mean that now it is working well with Apple Metal GPU (if setup as above)

Which means langchain & llama.cpp should be running much faster now - once you have upgraded to llama-cpp-python v0.1.62

https://python.langchain.com/en/latest/modules/models/llms/integrations/llamacpp.html

ianscrivener Jun 13, 2023

TL;DR

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --force-reinstall --no-cache-dir
pip install 'llama-cpp-python[server]' --force-reinstall --no-cache-dir

ianscrivener Jun 13, 2023

AFAIK

llama-cpp-python installs llama.cpp - so you do not need an llama.cpp install to use langchain
langchain uses the llama-cpp-python Low-level API and/or the High-level API... not the Webserver API

Performance w/ langchain? #1822

Uh oh!

frankandrobot Jun 12, 2023

Replies: 4 comments · 4 replies

Uh oh!

Green-Sky Jun 12, 2023 Collaborator

Uh oh!

frankandrobot Jun 12, 2023 Author

Uh oh!

TheBloke Jun 12, 2023

Uh oh!

ianscrivener Jun 12, 2023

Uh oh!

frankandrobot Jun 13, 2023 Author

Uh oh!

Uh oh!

ianscrivener Jun 13, 2023

Uh oh!

Uh oh!

ianscrivener Jun 13, 2023

Uh oh!

Uh oh!

ianscrivener Jun 13, 2023

frankandrobot
Jun 12, 2023

Replies: 4 comments 4 replies

Green-Sky
Jun 12, 2023
Collaborator

frankandrobot
Jun 12, 2023
Author

TheBloke
Jun 12, 2023

ianscrivener
Jun 12, 2023

frankandrobot Jun 13, 2023
Author