Getting started. #2597
-
Hey! My project currently uses GPT3.5 with API calls (which ends up being pretty expensive), and testing out the demo for llama2-70b-chat, it looks like it'd work well enough for at least part of the prompts I'm using. But things seem to be moving pretty fast, and I'm not clear on exactly what is possible and what is the best path forward. I have:
I can get if needed:
I want:
My question is: what's my best path forward? I'm been able to get the 7b version running following the instructions, but my local computer doesn't have enough RAM to run the 70b version (thus the mention of a server above, which I'm getting ready to start renting). I'm not sure exactly how to run on the GPU, and if I have enough RAM to do that or not. I'm not sure I follow the instructions for that. I'm also not sure what my options are besides llama.cpp, if any. I found this project, but I'm not sure if there are any others out there I could try / that would be a better option for what I'm trying to do here. Any help/pointers would be very much appreciated. Cheers. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
@arthurwolf, Were it me; I'd start my experimentations with 38.8Gb llama-2-70b-chat.ggmlv3.q4_0.bin We know GPU is significantly faster than CPU. I use Runpod for Cloud GPU. An A6000 has 62Gb GPU RAM... so could run the above Llama-2-70B-Chat model in GPU. RunPod has a template, also by TheBloke, that is a good starting point doco here There seems to be far more disussion of the MLOps stuff (model selection, hardware specs, setups etc) on Reddit and Huggingspace than here. The Bloke has a Discord channel as well - personally I'd start there. |
Beta Was this translation helpful? Give feedback.
-
These "Getting started" instructions no longer seem to be accurate, at least, they dont work from a Mac Terminal. I tried this command:
I started another discussion here for beginners, hoping that it will help me (and others down the line): #10631 |
Beta Was this translation helpful? Give feedback.
@arthurwolf,
llama.cpp can definately do the job! eg "I'm succesfully running llama-2-70b-chat.ggmlv3.q3_K_S on my 32 GB RAM on cpu with speed of 1.2 tokens/s without any GPU offloading (i dont have a descrete gpu), using full 4k context and kobold.cpp on windows 11 pro. mentioned here
Were it me; I'd start my experimentations with 38.8Gb llama-2-70b-chat.ggmlv3.q4_0.bin
We know GPU is significantly faster than CPU. I use Runpod for Cloud GPU. An A6000 has 62Gb GPU RAM... so could run the above Llama-2-70B-Chat model in GPU. RunPod has a template, also by TheBloke, that is a good starting point doco here
There seems to be far more disussion of the MLOps stuff (model selection, hardware sp…