Very promising! #22
evansumarosenberg
started this conversation in
General
Replies: 1 comment
-
Nice, so it works under WSL? Same speeds? I was able to push it to 19 tokens/sec on an Ada A6000 on RunPod: Splitting the model to use my 4090 /w the older A6000 using |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Fantastic work! I just started using exllama and the performance is very impressive. Here are some benchmarks from my initial testing today using the included benchmarking script (128 tokens, 1920 token prompt).
Model: TheBloke_guanaco-33B-GPTQ
Model: TheBloke_guanaco-65B-GPTQ
The 4090 was on my local machine with a Core i9-12900K on Windows 11 (WSL). The A100 and A40 benchmarks were run on an HPC cluster using a single compute node with 32 CPU cores. This is an order of magnitude increase in performance compared to using text-generation-webui and GPTQ-for-LLaMA. For comparison, with the 35B model, I was previously getting 8-10 tokens/sec on the 4090 and 2.5-3 tokens/sec on the A100.
I saw in your updates that you are working on a web UI, which is fantastic. I am more interested in a web API, so I will probably go ahead and implement a quick and dirty server if you're not already close to finishing one. Happy to share the code for that if you're interested.
By the way, I had to figure out a few extra steps to get things running in conda. The following worked for me in both WSL and Linux:
Beta Was this translation helpful? Give feedback.
All reactions