Here's my batching/caching API I made over the weekend. 200+ tk/s with Mistral 5.0bpw.exl2 on an RTX 3090 with concurrent requests. It was for a personal project, and it's not complete, but it's very fast. #247
epolewski
started this conversation in
Show and tell
Replies: 1 comment
-
Very cool! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Seems like the kind of crowd who'd enjoy some open source code showing how to implement a batching API with exllamav2:
https://github.com/epolewski/EricLLM
I made it to mostly be a drop-in replacement for vLLM while they fix a bug I can't seem to work around or find a solution to.
Beta Was this translation helpful? Give feedback.
All reactions