Best way to deploy a SLM/LLM model from huggingface. Need best approach. #12657

AakashNakarmi · 2025-02-02T06:27:20Z

AakashNakarmi
Feb 2, 2025

I want to deploy a SLM/LLM Model from hugging face with the least response time.
Please help me with the best library and approach for inferencing. If there is a template for production level inferencing that will be helpful.

I want a image or build a image that will be able to serve multiple requests at a time through a port.

Library: Transformers with pipeline
Current inference time: ~4-5 sec for 2000 tokens.
Model to deploy: Phi 3 mini instruct ~7.5GB.
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

System Configuation:
Nvidia 24 GB GPU
400 GB RAM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Best way to deploy a SLM/LLM model from huggingface. Need best approach. #12657

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Best way to deploy a SLM/LLM model from huggingface. Need best approach. #12657

Uh oh!

AakashNakarmi Feb 2, 2025

Replies: 0 comments

AakashNakarmi
Feb 2, 2025