1
- # 🦾 OpenLLM: Self-Hosting LLMs Made Easy
1
+ <div align =" center " >
2
+ <h1 >🦾 OpenLLM: Self-Hosting LLMs Made Easy</h1 >
3
+ </div >
2
4
3
5
[ ![ License: Apache-2.0] ( https://img.shields.io/badge/License-Apache%202-green.svg )] ( https://github.com/bentoml/OpenLLM/blob/main/LICENSE )
4
6
[ ![ Releases] ( https://img.shields.io/pypi/v/openllm.svg?logo=pypi&label=PyPI&logoColor=gold )] ( https://pypi.org/project/openllm )
@@ -25,16 +27,110 @@ openllm hello
25
27
26
28
OpenLLM supports a wide range of state-of-the-art open-source LLMs. You can also add a [ model repository to run custom models] ( #set-up-a-custom-repository ) with OpenLLM.
27
29
28
- | Model | Parameters | Quantization | Required GPU | Start a Server |
29
- | ---------------- | ---------- | ------------ | ------------- | ----------------------------------- |
30
- | Llama 3.3 | 70B | - | 80Gx2 | ` openllm serve llama3.3:70b ` |
31
- | Llama 3.2 | 3B | - | 12G | ` openllm serve llama3.2:3b ` |
32
- | Llama 3.2 Vision | 11B | - | 80G | ` openllm serve llama3.2:11b-vision ` |
33
- | Mistral | 7B | - | 24G | ` openllm serve mistral:7b ` |
34
- | Qwen 2.5 | 1.5B | - | 12G | ` openllm serve qwen2.5:1.5b ` |
35
- | Qwen 2.5 Coder | 7B | - | 24G | ` openllm serve qwen2.5-coder:7b ` |
36
- | Gemma 2 | 9B | - | 24G | ` openllm serve gemma2:9b ` |
37
- | Phi3 | 3.8B | - | 12G | ` openllm serve phi3:3.8b ` |
30
+ <table >
31
+ <tr >
32
+ <th>Model</th>
33
+ <th>Parameters</th>
34
+ <th>Required GPU</th>
35
+ <th>Start a Server</th>
36
+ </tr >
37
+ <tr >
38
+ <td>deepseek-r1</td>
39
+ <td>671B</td>
40
+ <td>80Gx16</td>
41
+ <td><code>openllm serve deepseek-r1:671b-fc3d</code></td>
42
+ </tr >
43
+ <tr >
44
+ <td>deepseek-r1-distill</td>
45
+ <td>14B</td>
46
+ <td>80G</td>
47
+ <td><code>openllm serve deepseek-r1-distill:qwen2.5-14b-98a9</code></td>
48
+ </tr >
49
+ <tr >
50
+ <td>deepseek-v3</td>
51
+ <td>671B</td>
52
+ <td>80Gx16</td>
53
+ <td><code>openllm serve deepseek-v3:671b-instruct-d7ec</code></td>
54
+ </tr >
55
+ <tr >
56
+ <td>gemma2</td>
57
+ <td>2B</td>
58
+ <td>12G</td>
59
+ <td><code>openllm serve gemma2:2b-instruct-747d</code></td>
60
+ </tr >
61
+ <tr >
62
+ <td>llama3.1</td>
63
+ <td>8B</td>
64
+ <td>24G</td>
65
+ <td><code>openllm serve llama3.1:8b-instruct-3c0c</code></td>
66
+ </tr >
67
+ <tr >
68
+ <td>llama3.2</td>
69
+ <td>1B</td>
70
+ <td>24G</td>
71
+ <td><code>openllm serve llama3.2:1b-instruct-f041</code></td>
72
+ </tr >
73
+ <tr >
74
+ <td>llama3.3</td>
75
+ <td>70B</td>
76
+ <td>80Gx2</td>
77
+ <td><code>openllm serve llama3.3:70b-instruct-b850</code></td>
78
+ </tr >
79
+ <tr >
80
+ <td>mistral</td>
81
+ <td>8B</td>
82
+ <td>24G</td>
83
+ <td><code>openllm serve mistral:8b-instruct-50e8</code></td>
84
+ </tr >
85
+ <tr >
86
+ <td>mistral-large</td>
87
+ <td>123B</td>
88
+ <td>80Gx4</td>
89
+ <td><code>openllm serve mistral-large:123b-instruct-1022</code></td>
90
+ </tr >
91
+ <tr >
92
+ <td>mistralai</td>
93
+ <td>24B</td>
94
+ <td>80G</td>
95
+ <td><code>openllm serve mistralai:24b-small-instruct-2501-0e69</code></td>
96
+ </tr >
97
+ <tr >
98
+ <td>mixtral</td>
99
+ <td>7B</td>
100
+ <td>80Gx2</td>
101
+ <td><code>openllm serve mixtral:8x7b-instruct-v0.1-b752</code></td>
102
+ </tr >
103
+ <tr >
104
+ <td>phi4</td>
105
+ <td>14B</td>
106
+ <td>80G</td>
107
+ <td><code>openllm serve phi4:14b-c12d</code></td>
108
+ </tr >
109
+ <tr >
110
+ <td>pixtral</td>
111
+ <td>12B</td>
112
+ <td>80G</td>
113
+ <td><code>openllm serve pixtral:12b-240910-c344</code></td>
114
+ </tr >
115
+ <tr >
116
+ <td>qwen2.5</td>
117
+ <td>7B</td>
118
+ <td>24G</td>
119
+ <td><code>openllm serve qwen2.5:7b-instruct-3260</code></td>
120
+ </tr >
121
+ <tr >
122
+ <td>qwen2.5-coder</td>
123
+ <td>7B</td>
124
+ <td>24G</td>
125
+ <td><code>openllm serve qwen2.5-coder:7b-instruct-e75d</code></td>
126
+ </tr >
127
+ <tr >
128
+ <td>qwen2.5vl</td>
129
+ <td>3B</td>
130
+ <td>24G</td>
131
+ <td><code>openllm serve qwen2.5vl:3b-instruct-4686</code></td>
132
+ </tr >
133
+ </table >
38
134
39
135
...
40
136
@@ -46,15 +142,16 @@ To start an LLM server locally, use the `openllm serve` command and specify the
46
142
47
143
> [ !NOTE]
48
144
> OpenLLM does not store model weights. A Hugging Face token (HF_TOKEN) is required for gated models.
145
+ >
49
146
> 1 . Create your Hugging Face token [ here] ( https://huggingface.co/settings/tokens ) .
50
- > 2 . Request access to the gated model, such as [ meta-llama/Meta- Llama-3-8B ] ( https://huggingface.co/meta-llama/Meta- Llama-3-8B ) .
147
+ > 2 . Request access to the gated model, such as [ meta-llama/Llama-3.2-1B-Instruct ] ( https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct ) .
51
148
> 3 . Set your token as an environment variable by running:
52
149
> ``` bash
53
150
> export HF_TOKEN=< your token>
54
151
> ` ` `
55
152
56
153
` ` ` bash
57
- openllm serve llama3:8b
154
+ openllm serve openllm serve llama3.2:1b-instruct-f041
58
155
```
59
156
60
157
The server will be accessible at [ http://localhost:3000 ] ( http://localhost:3000/ ) , providing OpenAI-compatible APIs for interaction. You can call the endpoints with different frameworks and tools that support OpenAI-compatible APIs. Typically, you may need to specify the following:
@@ -79,7 +176,7 @@ client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')
79
176
# print(model_list)
80
177
81
178
chat_completion = client.chat.completions.create(
82
- model = " meta-llama/Meta- Llama-3-8B -Instruct" ,
179
+ model = " meta-llama/Llama-3.2-1B -Instruct" ,
83
180
messages = [
84
181
{
85
182
" role" : " user" ,
@@ -94,17 +191,17 @@ for chunk in chat_completion:
94
191
95
192
</details >
96
193
97
-
98
194
<details >
99
195
100
196
<summary >LlamaIndex</summary >
101
197
102
198
``` python
103
199
from llama_index.llms.openai import OpenAI
104
200
105
- llm = OpenAI(api_bese = " http://localhost:3000/v1" , model = " meta-llama/Meta- Llama-3-8B -Instruct" , api_key = " dummy" )
201
+ llm = OpenAI(api_bese = " http://localhost:3000/v1" , model = " meta-llama/Llama-3.2-1B -Instruct" , api_key = " dummy" )
106
202
...
107
203
```
204
+
108
205
</details >
109
206
110
207
## Chat UI
@@ -138,7 +235,7 @@ openllm repo update
138
235
To review a model’s information, run:
139
236
140
237
``` bash
141
- openllm model get llama3:8b
238
+ openllm model get openllm serve llama3.2:1b-instruct-f041
142
239
```
143
240
144
241
### Add a model to the default model repository
@@ -166,7 +263,7 @@ OpenLLM supports LLM cloud deployment via BentoML, the unified model serving fra
166
263
[ Sign up for BentoCloud] ( https://www.bentoml.com/ ) for free and [ log in] ( https://docs.bentoml.com/en/latest/bentocloud/how-tos/manage-access-token.html ) . Then, run ` openllm deploy ` to deploy a model to BentoCloud:
167
264
168
265
``` bash
169
- openllm deploy llama3:8b
266
+ openllm deploy openllm serve llama3.2:1b-instruct-f041
170
267
```
171
268
172
269
> [ !NOTE]
@@ -196,7 +293,6 @@ This project uses the following open-source projects:
196
293
- [ bentoml/bentoml] ( https://github.com/bentoml/bentoml ) for production level model serving
197
294
- [ vllm-project/vllm] ( https://github.com/vllm-project/vllm ) for production level LLM backend
198
295
- [ blrchen/chatgpt-lite] ( https://github.com/blrchen/chatgpt-lite ) for a fancy Web Chat UI
199
- - [ chujiezheng/chat_templates] ( https://github.com/chujiezheng/chat_templates )
200
296
- [ astral-sh/uv] ( https://github.com/astral-sh/uv ) for blazing fast model requirements installing
201
297
202
298
We are grateful to the developers and contributors of these projects for their hard work and dedication.
0 commit comments