Best way to deploy a SLM/LLM model from huggingface. Need best approach. #12657
AakashNakarmi
announced in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I want to deploy a SLM/LLM Model from hugging face with the least response time.
Please help me with the best library and approach for inferencing. If there is a template for production level inferencing that will be helpful.
I want a image or build a image that will be able to serve multiple requests at a time through a port.
Library: Transformers with pipeline
Current inference time: ~4-5 sec for 2000 tokens.
Model to deploy: Phi 3 mini instruct ~7.5GB.
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
System Configuation:
Nvidia 24 GB GPU
400 GB RAM
Beta Was this translation helpful? Give feedback.
All reactions