You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/develop/rust/wasinn/llm_inference.md
+31-57Lines changed: 31 additions & 57 deletions
Original file line number
Diff line number
Diff line change
@@ -2,44 +2,11 @@
2
2
sidebar_position: 1
3
3
---
4
4
5
-
# Llama 2 inference
6
-
7
-
WasmEdge now supports running open source models in Rust. We will use [this example project](https://github.com/second-state/LlamaEdge/tree/main/chat) to show how to make AI inferences with the llama2 model in WasmEdge and Rust.
8
-
9
-
WasmEdge now supports the following models:
10
-
11
-
1. Llama-2-7B-Chat
12
-
1. Llama-2-13B-Chat
13
-
1. CodeLlama-13B-Instruct
14
-
1. Mistral-7B-Instruct-v0.1
15
-
1. Mistral-7B-Instruct-v0.2
16
-
1. MistralLite-7B
17
-
1. OpenChat-3.5-0106
18
-
1. OpenChat-3.5-1210
19
-
1. OpenChat-3.5
20
-
1. Wizard-Vicuna-13B-Uncensored-GGUF
21
-
1. TinyLlama-1.1B-Chat-v1.0
22
-
1. Baichuan2-13B-Chat
23
-
1. OpenHermes-2.5-Mistral-7B
24
-
1. Dolphin-2.2-Yi-34B
25
-
1. Dolphin-2.6-Mistral-7B
26
-
1. Samantha-1.2-Mistral-7B
27
-
1. Samantha-1.11-CodeLlama-34B
28
-
1. WizardCoder-Python-7B-V1.0
29
-
1. Zephyr-7B-Alpha
30
-
1. WizardLM-13B-V1.0-Uncensored
31
-
1. Orca-2-13B
32
-
1. Neural-Chat-7B-v3-1
33
-
1. Yi-34B-Chat
34
-
1. Starling-LM-7B-alpha
35
-
1. DeepSeek-Coder-6.7B
36
-
1. DeepSeek-LLM-7B-Chat
37
-
1. SOLAR-10.7B-Instruct-v1.0
38
-
1. Mixtral-8x7B-Instruct-v0.1
39
-
1. Nous-Hermes-2-Mixtral-8x7B-DPO
40
-
1. Nous-Hermes-2-Mixtral-8x7B-SFT
41
-
42
-
And more, please check [the supported models](https://github.com/second-state/LlamaEdge/blob/main/models.md) for details.
5
+
# LLM inference
6
+
7
+
WasmEdge now supports running open-source Large Language Models (LLMs) in Rust. We will use [this example project](https://github.com/second-state/LlamaEdge/tree/main/chat) to show how to make AI inferences with the llama-3.1-8B model in WasmEdge and Rust.
8
+
9
+
Basically, WasmEdge can support any open-source LLMs. Please check [the supported models](https://github.com/second-state/LlamaEdge/blob/main/models.md) for details.
43
10
44
11
## Prerequisite
45
12
@@ -55,23 +22,23 @@ First, get the latest llama-chat wasm application
Next, let's get the model. In this example, we are going to use the llama2 7b chat model in GGUF format. You can also use other kinds of llama2 models, check out [here](https://github.com/second-state/llamaedge/blob/main/chat/README.md#get-model).
25
+
Next, let's get the model. In this example, we are going to use the llama-3.1-8B model in GGUF format. You can also use other kinds of LLMs, check out [here](https://github.com/second-state/llamaedge/blob/main/chat/README.md#get-model).
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
71
38
72
39
```bash
73
40
[USER]:
74
-
I have two apples, each costing 5 dollars. What is the total cost of these apple
41
+
I have two apples, each costing 5 dollars. What is the total cost of these apples?
75
42
[ASSISTANT]:
76
43
The total cost of the two apples is 10 dollars.
77
44
[USER]:
@@ -95,19 +62,26 @@ Second, use `cargo` to build the example project.
95
62
cargo build --target wasm32-wasi --release
96
63
```
97
64
98
-
The output WASM file is `target/wasm32-wasi/release/llama-chat.wasm`. Next, use WasmEdge to load the llama-2-7b model and then ask the model to questions.
65
+
The output WASM file is `target/wasm32-wasi/release/llama-chat.wasm`. Next, use WasmEdge to load the llama-3.1-8b model and then ask the model questions.
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
71
+
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[You]:` prompt:
105
72
106
73
```bash
107
-
[USER]:
108
-
Who is Robert Oppenheimer?
109
-
[ASSISTANT]:
110
-
Robert Oppenheimer was an American theoretical physicist and director of the Manhattan Project, which developed the atomic bomb during World War II. He is widely regarded as one of the most important physicists of the 20th century and is known forhis contributions to the development of quantum mechanics and the theory of the atomic nucleus. Oppenheimer was also a prominent figurein the post-war nuclear weapons debate and was a strong advocate for international cooperation on nuclear weapons control.
74
+
[You]:
75
+
Which one is greater? 9.11 or 9.8?
76
+
77
+
[Bot]:
78
+
9.11 is greater.
79
+
80
+
[You]:
81
+
why
82
+
83
+
[Bot]:
84
+
11 is greater than 8.
111
85
```
112
86
113
87
## Options
@@ -118,13 +92,13 @@ You can configure the chat inference application through CLI options.
118
92
-m, --model-alias <ALIAS>
119
93
Model alias [default: default]
120
94
-c, --ctx-size <CTX_SIZE>
121
-
Size of the prompt context [default: 4096]
95
+
Size of the prompt context [default: 512]
122
96
-n, --n-predict <N_PRDICT>
123
97
Number of tokens to predict [default: 1024]
124
98
-g, --n-gpu-layers <N_GPU_LAYERS>
125
99
Number of layers to run on the GPU [default: 100]
126
100
-b, --batch-size <BATCH_SIZE>
127
-
Batch size for prompt processing [default: 4096]
101
+
Batch size for prompt processing [default: 512]
128
102
-r, --reverse-prompt <REVERSE_PROMPT>
129
103
Halt generation at PROMPT, return control.
130
104
-s, --system-prompt <SYSTEM_PROMPT>
@@ -148,8 +122,8 @@ The `--prompt-template` option is perhaps the most interesting. It allows the ap
148
122
Furthermore, the following command tells WasmEdge to print out logs and statistics of the model at runtime.
--prompt 'Robert Oppenheimer most important achievement is ' --ctx-size 512
190
164
191
165
output: in 1942, when he led the team that developed the first atomic bomb, which was dropped on Hiroshima, Japan in 1945.
@@ -275,7 +249,7 @@ Next, execute the model inference.
275
249
context.compute().expect("Failed to complete inference");
276
250
```
277
251
278
-
After the inference is finished, extract the result from the computation context and lose invalid UTF8 sequences handled by converting the output to a string using `String::from_utf8_lossy`.
252
+
After the inference is finished, extract the result from the computation context and losing invalid UTF8 sequences handled by converting the output to a string using `String::from_utf8_lossy`.
279
253
280
254
```rust
281
255
let mut output_buffer = vec![0u8; *CTX_SIZE.get().unwrap()];
* If you're looking for multi-turn conversations with llama 2 models, please check out the above mentioned chat example source code [here](https://github.com/second-state/llamaedge/tree/main/chat).
299
-
* If you want to construct OpenAI-compatible APIs specifically forany open-source LLMs, please check out the source code [for the API server](https://github.com/second-state/llamaedge/tree/main/api-server).
273
+
* If you want to construct OpenAI-compatible APIs specifically foryour llama2 model, or the Llama2 model itself, please check out the source code [for the API server](https://github.com/second-state/llamaedge/tree/main/api-server).
300
274
* To learn more, please check out [this article](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359).
0 commit comments