Skip to content

Commit c39905e

Browse files
committed
Update llm_inference.md
Signed-off-by: Michael Yuan <michael@secondstate.io>
1 parent 8b5d011 commit c39905e

File tree

1 file changed

+44
-41
lines changed

1 file changed

+44
-41
lines changed

docs/develop/rust/wasinn/llm_inference.md

Lines changed: 44 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -13,47 +13,34 @@ WasmEdge now supports Llama2, Codellama-instruct, BELLE-Llama, Mistral-7b-instru
1313
Besides the [regular WasmEdge and Rust requirements](../../rust/setup.md), please make sure that you have the [Wasi-NN plugin with ggml installed](../../../start/install.md#wasi-nn-plug-in-with-ggml-backend).
1414

1515
## Quick start
16+
1617
Because the example already includes a compiled WASM file from the Rust code, we could use WasmEdge CLI to execute the example directly. First, git clone the `llama-utils` repo.
1718

1819
```bash
19-
git clone https://github.com/second-state/llama-utils.git
20-
cd chat
20+
curl -LO https://github.com/second-state/llama-utils/raw/main/chat/llama-chat.wasm
2121
```
2222

2323
Next, let's get the model. In this example, we are going to use the llama2 7b chat model in GGUF format. You can also use other kinds of llama2 models, check out [here](https://github.com/second-state/llama-utils/blob/main/chat/README.md#get-model).
2424

2525
```bash
26-
git clone curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-7b-chat-q5_k_m.gguf
26+
curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-7b-chat-q5_k_m.gguf
2727
```
2828

2929
Run the inference application in WasmEdge.
3030

3131
```bash
32-
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
33-
llama-chat.wasm --prompt-template llama-2-chat
32+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
3433
```
3534

3635
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
3736

3837
```bash
3938
[USER]:
4039
I have two apples, each costing 5 dollars. What is the total cost of these apple
41-
*** [prompt begin] ***
42-
<s>[INST] <<SYS>>
43-
You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <</SYS>>
44-
45-
I have two apples, each costing 5 dollars. What is the total cost of these apple [/INST]
46-
*** [prompt end] ***
4740
[ASSISTANT]:
4841
The total cost of the two apples is 10 dollars.
4942
[USER]:
5043
How about four apples?
51-
*** [prompt begin] ***
52-
<s>[INST] <<SYS>>
53-
You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <</SYS>>
54-
55-
I have two apples, each costing 5 dollars. What is the total cost of these apple [/INST] The total cost of the two apples is 10 dollars. </s><s>[INST] How about four apples? [/INST]
56-
*** [prompt end] ***
5744
[ASSISTANT]:
5845
The total cost of four apples is 20 dollars.
5946
```
@@ -84,28 +71,21 @@ curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-13b-q5_k_m.ggu
8471
Next, use WasmEdge to load the llama-2-13b model and then ask the model to questions.
8572

8673
```bash
87-
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-chat-q5_k_m.gguf \
88-
llama-chat.wasm --prompt-template llama-2-chat
74+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-chat-q5_k_m.gguf llama-chat.wasm
8975
```
9076

9177
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
9278

9379
```bash
9480
[USER]:
9581
Who is Robert Oppenheimer?
96-
*** [prompt begin] ***
97-
<s>[INST] <<SYS>>
98-
You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <</SYS>>
99-
100-
Who is Robert Oppenheimer? [/INST]
101-
*** [prompt end] ***
10282
[ASSISTANT]:
10383
Robert Oppenheimer was an American theoretical physicist and director of the Manhattan Project, which developed the atomic bomb during World War II. He is widely regarded as one of the most important physicists of the 20th century and is known for his contributions to the development of quantum mechanics and the theory of the atomic nucleus. Oppenheimer was also a prominent figure in the post-war nuclear weapons debate and was a strong advocate for international cooperation on nuclear weapons control.
10484
```
10585
106-
## Optional: run the model with different CLI
86+
## Options
10787
108-
We also have CLI options for more information.
88+
You can configure the chat inference application through CLI options.
10989
11090
```bash
11191
-m, --model-alias <ALIAS>
@@ -128,19 +108,33 @@ We also have CLI options for more information.
128108
Print prompt strings to stdout
129109
--log-stat
130110
Print statistics to stdout
131-
--log-enable
111+
--log-all
132112
Print all log information to stdout
133113
--stream-stdout
134114
Print the output to stdout in the streaming way
135115
-h, --help
136116
Print help
137117
```
138118
139-
For example, the following command tells WasmEdge to print out logs and statistics of the model at runtime.
119+
The `--prompt-template` option is perhaps the most interesting. It allows the application to support different open source LLM models beyond llama2.
120+
121+
| Template name | Model | Download |
122+
| ------------ | ------------------------------ | --- |
123+
| llama-2-chat | [The standard llama2 chat model](https://ai.meta.com/llama/) | [7b](https://huggingface.co/wasmedge/llama2/resolve/main/llama-2-7b-chat-q5_k_m.gguf) |
124+
| codellama-instruct | [CodeLlama](https://about.fb.com/news/2023/08/code-llama-ai-for-coding/) | [7b](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF/resolve/main/codellama-7b-instruct.Q5_K_M.gguf) |
125+
| mistral-instruct-v0.1 | [Mistral](https://mistral.ai/) | [7b](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf) |
126+
| mistrallite | [Mistral Lite](https://huggingface.co/amazon/MistralLite) | [7b](https://huggingface.co/TheBloke/MistralLite-7B-GGUF/resolve/main/mistrallite.Q5_K_M.gguf) |
127+
| openchat | [OpenChat](https://github.com/imoneoi/openchat) | [7b](https://huggingface.co/TheBloke/openchat_3.5-GGUF/resolve/main/openchat_3.5.Q5_K_M.gguf) |
128+
| belle-llama-2-chat | [BELLE](https://github.com/LianjiaTech/BELLE) | [13b](https://huggingface.co/second-state/BELLE-Llama2-13B-Chat-0.4M-GGUF/resolve/main/BELLE-Llama2-13B-Chat-0.4M-ggml-model-q4_0.gguf) |
129+
| vicuna-chat | [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) | [7b](https://huggingface.co/TheBloke/vicuna-7B-v1.5-GGUF/resolve/main/vicuna-7b-v1.5.Q5_K_M.gguf) |
130+
| chatml | [ChatML](https://huggingface.co/chargoddard/rpguild-chatml-13b) | [13b](https://huggingface.co/TheBloke/rpguild-chatml-13B-GGUF/resolve/main/rpguild-chatml-13b.Q5_K_M.gguf) |
131+
132+
133+
Furthermore, the following command tells WasmEdge to print out logs and statistics of the model at runtime.
140134
141135
```
142136
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
143-
llama-chat.wasm --prompt-template llama-2-chat --log-enable
137+
llama-chat.wasm --prompt-template llama-2-chat --log-stat
144138
..................................................................................................
145139
llama_new_context_with_model: n_ctx = 512
146140
llama_new_context_with_model: freq_base = 10000.0
@@ -158,27 +152,35 @@ llama_print_timings: total time = 25104.57 ms
158152
Ah, a fellow Peanuts enthusiast! Snoopy is Charlie Brown's lovable and imaginative beagle, known for his wild and wacky adventures in the comic strip and television specials. He's a loyal companion to Charlie Brown and the rest of the Peanuts gang, and his antics often provide comic relief in the series. Is there anything else you'd like to know about Snoopy? 🐶
159153
```
160154
161-
## Improve performance
155+
## Improving performance
162156
163157
You can make the inference program run faster by AOT compiling the wasm file first.
164158
165159
```bash
166160
wasmedge compile llama-chat.wasm llama-chat.wasm
167-
wasmedge --dir .:. \
168-
--nn-preload default:GGML:CPU:llama-2-13b-q5_k_m.gguf \
169-
llama-chat.wasm --model-alias default --prompt-template llama-2-chat
161+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-q5_k_m.gguf llama-chat.wasm
170162
```
171163
172164
## Understand the code
173165
174-
The [main.rs](https://github.com/second-state/llama-utils/blob/main/chat/src/main.rs
175-
) is the full Rust code to create an interactive chatbot using a LLM. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2 and other model’s chat templates, and runs the inference operations using the WASI NN standard API.
166+
The [main.rs](https://github.com/second-state/llama-utils/blob/main/chat/src/main.rs) is the full Rust code to create an interactive chatbot using a LLM. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2 and other model’s chat templates, and runs the inference operations using the WASI NN standard API. The code logic for the chat interaction is somewhat complex. In this section, we will use the [simple example](https://github.com/second-state/llama-utils/tree/main/simple) to explain how to set up and perform one inference round trip. Here is how you use the simple example.
167+
168+
```bash
169+
# Download the compiled simple inference wasm
170+
curl -LO https://github.com/second-state/llama-utils/raw/main/simple/llama-simple.wasm
171+
172+
# Give it a prompt and ask it to use the model to complete it.
173+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-simple.wasm \
174+
--prompt 'Robert Oppenheimer most important achievement is ' --ctx-size 4096
175+
176+
output: in 1942, when he led the team that developed the first atomic bomb, which was dropped on Hiroshima, Japan in 1945.
177+
```
176178
177179
First, let's parse command line arguments to customize the chatbot's behavior using `Command` struct. It extracts the following parameters: `prompt` (a prompt that guides the conversation), `model_alias` (a list for the loaded model), and `ctx_size` (the size of the chat context).
178180
179181
```rust
180182
fn main() -> Result<(), String> {
181-
let matches = Command::new("Llama API Server")
183+
let matches = Command::new("Simple LLM inference")
182184
.arg(
183185
Arg::new("prompt")
184186
.short('p')
@@ -276,8 +278,9 @@ println!("\nprompt: {}", &prompt);
276278
println!("\noutput: {}", output);
277279
```
278280
279-
The code explanation above is simple [one time chat with llama 2 model](https://github.com/second-state/llama-utils/tree/main/simple). But we have more!
281+
## Resources
282+
283+
* If you're looking for multi-turn conversations with llama 2 models, please check out the above mentioned chat example source code [here](https://github.com/second-state/llama-utils/tree/main/chat).
284+
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the source code [for the API server](https://github.com/second-state/llama-utils/tree/main/api-server).
285+
* To learn more, please check out [this article](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359).
280286

281-
* If you're looking for continuous conversations with llama 2 models, please check out the source code [here](https://github.com/second-state/llama-utils/tree/main/chat).
282-
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the surce code [here](https://github.com/second-state/llama-utils/tree/main/api-server).
283-
* For the reason why we need to run LLama2 model with WasmEdge, please check out [this article](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359).

0 commit comments

Comments
 (0)