Skip to content

kingabzpro/Deploying-Llama-3.3-70B

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deploying-Llama-3.3-70B

Deploy and serve Llama 3.3 70B with AWQ quantization using vLLM and BentoML.

Clone the repository

git clone https://github.com/kingabzpro/Deploying-Llama-3.3-70B.git
cd Deploying-Llama-3.3-70B.git

Deployment

  1. Install dependencies:
pip install -r requirements.txt
  1. Logged in to BentoCloud:
bentoml cloud login
  1. Deploy the model:
bentoml deploy .

Inference

The model can be accessed via CURL command, BentoML Python Client, or OpenAI Python client.:

from openai import OpenAI

client = OpenAI(base_url="<BentoCloud endpoint>", api_key="<Your BentoCloud API key>")

chat_completion = client.chat.completions.create(
    model="casperhansen/llama-3.3-70b-instruct-awq",
    messages=[
        {
            "role": "user",
            "content": "What is a black hole and how does it work?"
        }
    ],
    stream=True,
	stop=["<|eot_id|>", "<|end_of_text|>"],
)
for chunk in chat_completion:
    print(chunk.choices[0].delta.content or "", end="")

About

Serve Llama 3.3 70B (with AWQ quantization) using vLLM and deploy it on BentoCloud.

Topics

Resources

License

Stars

Watchers

Forks