Skip to content

Commit a421874

Browse files
authored
add support for mixtral-8x7b and mixtral-8x7b-instruct (#408)
* bump datadog module to 0.47.0 for ipv6 support for dogstatsd * add mixtral-8x7b and mixtral-8x7b-instruct * update context window * docker update * install megablocks
1 parent 15edb5d commit a421874

File tree

5 files changed

+41
-29
lines changed

5 files changed

+41
-29
lines changed

docs/model_zoo.md

Lines changed: 28 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,34 @@
22

33
Scale hosts the following models in the LLM Engine Model Zoo:
44

5-
| Model Name | Inference APIs Available | Fine-tuning APIs Available | Inference Frameworks Available | Inference max total tokens (prompt + response) |
6-
| --------------------- | ------------------------ | -------------------------- | ------------------------------ | ------------------------------ |
7-
| `llama-7b` ||| deepspeed, text-generation-inference | 2048 |
8-
| `llama-2-7b` ||| text-generation-inference, vllm | 4096|
9-
| `llama-2-7b-chat` || | text-generation-inference, vllm | 4096|
10-
| `llama-2-13b` || | text-generation-inference, vllm | 4096|
11-
| `llama-2-13b-chat` || | text-generation-inference, vllm | 4096|
12-
| `llama-2-70b` ||| text-generation-inference, vllm | 4096|
13-
| `llama-2-70b-chat` || | text-generation-inference, vllm | 4096|
14-
| `falcon-7b` || | text-generation-inference, vllm | 2048 |
15-
| `falcon-7b-instruct` || | text-generation-inference, vllm | 2048 |
16-
| `falcon-40b` || | text-generation-inference, vllm | 2048 |
17-
| `falcon-40b-instruct` || | text-generation-inference, vllm | 2048 |
18-
| `mpt-7b` || | deepspeed, text-generation-inference, vllm | 2048 |
19-
| `mpt-7b-instruct` ||| deepspeed, text-generation-inference, vllm | 2048 |
20-
| `flan-t5-xxl` || | deepspeed, text-generation-inference | 2048 |
21-
| `mistral-7b` ||| vllm | 8000 |
22-
| `mistral-7b-instruct` ||| vllm | 8000 |
23-
| `codellama-7b` ||| text-generation-inference, vllm | 16384 |
24-
| `codellama-7b-instruct` ||| text-generation-inference, vllm | 16384 |
25-
| `codellama-13b` ||| text-generation-inference, vllm | 16384 |
26-
| `codellama-13b-instruct` ||| text-generation-inference, vllm | 16384 |
27-
| `codellama-34b` ||| text-generation-inference, vllm | 16384 |
28-
| `codellama-34b-instruct` ||| text-generation-inference, vllm | 16384 |
29-
| `zephyr-7b-alpha` || | text-generation-inference, vllm | 32768 |
30-
| `zephyr-7b-beta` || | text-generation-inference, vllm | 32768 |
5+
| Model Name | Inference APIs Available | Fine-tuning APIs Available | Inference Frameworks Available | Inference max total tokens (prompt + response) |
6+
| ------------------------ | ------------------------ | -------------------------- | ------------------------------------------ | ---------------------------------------------- |
7+
| `llama-7b` ||| deepspeed, text-generation-inference | 2048 |
8+
| `llama-2-7b` ||| text-generation-inference, vllm | 4096 |
9+
| `llama-2-7b-chat` || | text-generation-inference, vllm | 4096 |
10+
| `llama-2-13b` || | text-generation-inference, vllm | 4096 |
11+
| `llama-2-13b-chat` || | text-generation-inference, vllm | 4096 |
12+
| `llama-2-70b` ||| text-generation-inference, vllm | 4096 |
13+
| `llama-2-70b-chat` || | text-generation-inference, vllm | 4096 |
14+
| `falcon-7b` || | text-generation-inference, vllm | 2048 |
15+
| `falcon-7b-instruct` || | text-generation-inference, vllm | 2048 |
16+
| `falcon-40b` || | text-generation-inference, vllm | 2048 |
17+
| `falcon-40b-instruct` || | text-generation-inference, vllm | 2048 |
18+
| `mpt-7b` || | deepspeed, text-generation-inference, vllm | 2048 |
19+
| `mpt-7b-instruct` ||| deepspeed, text-generation-inference, vllm | 2048 |
20+
| `flan-t5-xxl` || | deepspeed, text-generation-inference | 2048 |
21+
| `mistral-7b` ||| vllm | 8000 |
22+
| `mistral-7b-instruct` ||| vllm | 8000 |
23+
| `mixtral-8x7b` || | vllm | 32768 |
24+
| `mixtral-8x7b-instruct` || | vllm | 32768 |
25+
| `codellama-7b` ||| text-generation-inference, vllm | 16384 |
26+
| `codellama-7b-instruct` ||| text-generation-inference, vllm | 16384 |
27+
| `codellama-13b` ||| text-generation-inference, vllm | 16384 |
28+
| `codellama-13b-instruct` ||| text-generation-inference, vllm | 16384 |
29+
| `codellama-34b` ||| text-generation-inference, vllm | 16384 |
30+
| `codellama-34b-instruct` ||| text-generation-inference, vllm | 16384 |
31+
| `zephyr-7b-alpha` || | text-generation-inference, vllm | 32768 |
32+
| `zephyr-7b-beta` || | text-generation-inference, vllm | 32768 |
3133

3234
## Usage
3335

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,8 @@
165165
"codellama-34b-instruct",
166166
"mistral-7b",
167167
"mistral-7b-instruct",
168+
"mixtral-8x7b",
169+
"mixtral-8x7b-instruct",
168170
"mammoth-coder-llama-2-7b",
169171
"mammoth-coder-llama-2-13b",
170172
"mammoth-coder-llama-2-34b",
@@ -210,6 +212,7 @@
210212
# Can also see 13B, 34B there too
211213
"llama-2": {"max_model_len": None, "max_num_batched_tokens": 4096},
212214
"mistral": {"max_model_len": 8000, "max_num_batched_tokens": 8000},
215+
"mixtral": {"max_model_len": 32768, "max_num_batched_tokens": 32768},
213216
"zephyr": {"max_model_len": 32768, "max_num_batched_tokens": 32768},
214217
}
215218

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
1-
FROM nvcr.io/nvidia/pytorch:22.12-py3
1+
FROM nvcr.io/nvidia/pytorch:23.09-py3
22

33
RUN pip uninstall torch -y
44
COPY requirements.txt /workspace/requirements.txt
55
RUN pip install -r requirements.txt
6+
7+
# install special version of megablocks
8+
RUN pip install git+https://github.com/stanford-futuredata/megablocks.git@5897cd6f254b7b3edf7a708a3a3314ecb54b6f78#egg=megablocks
9+
610
RUN wget https://github.com/peak/s5cmd/releases/download/v2.2.1/s5cmd_2.2.1_Linux-64bit.tar.gz
711
RUN tar -xvzf s5cmd_2.2.1_Linux-64bit.tar.gz
12+
813
COPY vllm_server.py /workspace/vllm_server.py
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
ray==2.6.3
2-
vllm==0.2.0
3-
pydantic==1.10.12
2+
vllm==0.2.5
3+
pydantic==1.10.13

model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ def get_default_supported_models_info() -> Dict[str, ModelInfo]:
5858
),
5959
"mistral-7b": ModelInfo("mistralai/Mistral-7B-v0.1", None),
6060
"mistral-7b-instruct": ModelInfo("mistralai/Mistral-7B-Instruct-v0.1", None),
61+
"mixtral-8x7b": ModelInfo("mistralai/Mixtral-8x7B-v0.1", None),
62+
"mixtral-8x7b-instruct": ModelInfo("mistralai/Mixtral-8x7B-Instruct-v0.1", None),
6163
"mammoth-coder-llama-2-7b": ModelInfo("TIGER-Lab/MAmmoTH-Coder-7B", None),
6264
"mammoth-coder-llama-2-13b": ModelInfo("TIGER-Lab/MAmmoTH-Coder-13B", None),
6365
"mammoth-coder-llama-2-34b": ModelInfo("TIGER-Lab/MAmmoTH-Coder-34B", None),

0 commit comments

Comments
 (0)