Skip to content

vino9net/vllm-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

run test program

create a .env with HuggingFace token

HF_TOKEN="hf_xxx"
HUGGINGFACE_HUB_TOKEN="hf_xxx"
# the following will download the model
# and run inference

uv sync
source .venv/bin/activate
python main.py

Run OpenAI API server

uv sync
source .venv/bin/activate

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 4096 \
  --dtype auto \
  --api-key token-edgar-1470

vllm serve microsoft/Phi-4-mini-instruct \
  --gpu-memory-utilization 0.95 \
  --max-model-len 4096 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 32768 \
  --enable-chunked-prefill \
  --worker-use-ray \
  --dtype bfloat16 \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key token-abc123


# in a separate terminal
python client.py

In the above example allows 4096 total input and output token. (1 token = 0.75 word)

Hardware

Tested on an AWS G6 2xlarge instance with a NVIDIA L4 GPU with 23M of RAM. using AMI Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04) 20250711

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published