Getting started. #2597

arthurwolf · 2023-08-13T01:28:37Z

arthurwolf
Aug 13, 2023

Hey!

My project currently uses GPT3.5 with API calls (which ends up being pretty expensive), and testing out the demo for llama2-70b-chat, it looks like it'd work well enough for at least part of the prompts I'm using.

But things seem to be moving pretty fast, and I'm not clear on exactly what is possible and what is the best path forward.

I have:

A CPU with 12 cores and about 40GB of free RAM
A GPU (3070) with about 10GB of free VRAM (could get a bit more by turning screens off).

I can get if needed:

A server with any amount of RAM, up to 256GB if needed, and a beefy CPU.

I want:

To run llama2-70b-chat for my code to query/call with prompts.

My question is: what's my best path forward?

I'm been able to get the 7b version running following the instructions, but my local computer doesn't have enough RAM to run the 70b version (thus the mention of a server above, which I'm getting ready to start renting).

I'm not sure exactly how to run on the GPU, and if I have enough RAM to do that or not. I'm not sure I follow the instructions for that.

I'm also not sure what my options are besides llama.cpp, if any. I found this project, but I'm not sure if there are any others out there I could try / that would be a better option for what I'm trying to do here.

Any help/pointers would be very much appreciated.

Cheers.

Answered by ianscrivener

Aug 13, 2023

@arthurwolf,
llama.cpp can definately do the job! eg "I'm succesfully running llama-2-70b-chat.ggmlv3.q3_K_S on my 32 GB RAM on cpu with speed of 1.2 tokens/s without any GPU offloading (i dont have a descrete gpu), using full 4k context and kobold.cpp on windows 11 pro. mentioned here

Were it me; I'd start my experimentations with 38.8Gb llama-2-70b-chat.ggmlv3.q4_0.bin

We know GPU is significantly faster than CPU. I use Runpod for Cloud GPU. An A6000 has 62Gb GPU RAM... so could run the above Llama-2-70B-Chat model in GPU. RunPod has a template, also by TheBloke, that is a good starting point doco here

There seems to be far more disussion of the MLOps stuff (model selection, hardware sp…

View full answer

ianscrivener · 2023-08-13T02:09:14Z

ianscrivener
Aug 13, 2023

@arthurwolf,
llama.cpp can definately do the job! eg "I'm succesfully running llama-2-70b-chat.ggmlv3.q3_K_S on my 32 GB RAM on cpu with speed of 1.2 tokens/s without any GPU offloading (i dont have a descrete gpu), using full 4k context and kobold.cpp on windows 11 pro. mentioned here

Were it me; I'd start my experimentations with 38.8Gb llama-2-70b-chat.ggmlv3.q4_0.bin

We know GPU is significantly faster than CPU. I use Runpod for Cloud GPU. An A6000 has 62Gb GPU RAM... so could run the above Llama-2-70B-Chat model in GPU. RunPod has a template, also by TheBloke, that is a good starting point doco here

There seems to be far more disussion of the MLOps stuff (model selection, hardware specs, setups etc) on Reddit and Huggingspace than here.

The Bloke has a Discord channel as well - personally I'd start there.

5 replies

arthurwolf Aug 13, 2023
Author

Thanks a megaton for the detailed reply, that's going to help immensely.

arthurwolf Aug 18, 2023
Author

So I'm doing

./server -m ~/models/llama-2-70b.ggmlv3.q5_0.bin -gqa 8 --host x.x.x.x --port 8123

and I'm getting

llama_print_timings:        load time = 159370.45 ms
llama_print_timings:      sample time =    64.06 ms /    20 runs   (    3.20 ms per token,   312.18 tokens per second)
llama_print_timings: prompt eval time = 159365.63 ms /   341 tokens (  467.35 ms per token,     2.14 tokens per second)
llama_print_timings:        eval time = 59522.49 ms /    19 runs   ( 3132.76 ms per token,     0.32 tokens per second)
llama_print_timings:       total time = 218965.89 ms

That seems low/slow compared to what I've seen around. This is with a 24 core AMD CPU, no GPU offloading.

Is there anything I can tune/play with that would change speed significantly without impacting quality?

ianscrivener Aug 20, 2023

you are running the model in CPU. Try adding --n_gpu_layers 1000 to use GPU

arthurwolf Aug 21, 2023
Author

I'm unfortunately unable to use the GPU, I've spent a full day installing/reinstalling CUDA and installing/reinstalling Nvidia drivers, and I can absolutely never get it to build, it always complains about something, most of the time version mismatches. I really wish there was a straightforward way to set this up...

ianscrivener Aug 21, 2023

Have you tried a running llama.cpp inside a Docker container? That will side step some of the version issues.

check your base/host OS nvidia drivers with nvidia-smi
Install NVIDIA Container Toolkit to your host. "This integrates into Docker Engine to automatically configure your containers for GPU support"
the llama.cpp dockerfile is here

Then something like

git clone https://github.com/ggerganov/llama.cpp.git
cd .devops
docker build --file main-cuda.Dockerfile

docker run --gpus all ...

jasonkaplan79 · 2024-12-03T00:28:33Z

jasonkaplan79
Dec 3, 2024

These "Getting started" instructions no longer seem to be accurate, at least, they dont work from a Mac Terminal.

I tried this command:

./server -m ~/models/llama-2-70b.ggmlv3.q5_0.bin -gqa 8 --host x.x.x.x --port 8123

I started another discussion here for beginners, hoping that it will help me (and others down the line): #10631

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting started. #2597

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Getting started. #2597

Uh oh!

arthurwolf Aug 13, 2023

Replies: 2 comments · 5 replies

Uh oh!

ianscrivener Aug 13, 2023

Uh oh!

arthurwolf Aug 13, 2023 Author

Uh oh!

arthurwolf Aug 18, 2023 Author

Uh oh!

ianscrivener Aug 20, 2023

Uh oh!

arthurwolf Aug 21, 2023 Author

Uh oh!

Uh oh!

ianscrivener Aug 21, 2023

Uh oh!

jasonkaplan79 Dec 3, 2024

arthurwolf
Aug 13, 2023

Replies: 2 comments 5 replies

ianscrivener
Aug 13, 2023

arthurwolf Aug 13, 2023
Author

arthurwolf Aug 18, 2023
Author

arthurwolf Aug 21, 2023
Author

jasonkaplan79
Dec 3, 2024