Skip to content

Commit 514afc1

Browse files
authored
Merge pull request WasmEdge#188 from alabulei1/alabulei1-patch-1
Signed-off-by: Michael Yuan <michael@secondstate.io>
2 parents e3811ce + 6036292 commit 514afc1

File tree

2 files changed

+185
-120
lines changed

2 files changed

+185
-120
lines changed

docs/develop/rust/wasinn/llm_inference.md

Lines changed: 100 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -13,35 +13,36 @@ WasmEdge now supports Llama2, Codellama-instruct, BELLE-Llama, Mistral-7b-instru
1313
Besides the [regular WasmEdge and Rust requirements](../../rust/setup.md), please make sure that you have the [Wasi-NN plugin with ggml installed](../../../start/install.md#wasi-nn-plug-in-with-ggml-backend).
1414

1515
## Quick start
16+
1617
Because the example already includes a compiled WASM file from the Rust code, we could use WasmEdge CLI to execute the example directly. First, git clone the `llama-utils` repo.
1718

1819
```bash
19-
git clone https://github.com/second-state/llama-utils.git
20-
cd chat
20+
curl -LO https://github.com/second-state/llama-utils/raw/main/chat/llama-chat.wasm
2121
```
2222

2323
Next, let's get the model. In this example, we are going to use the llama2 7b chat model in GGUF format. You can also use other kinds of llama2 models, check out [here](https://github.com/second-state/llama-utils/blob/main/chat/README.md#get-model).
2424

2525
```bash
26-
git clone curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-7b-chat-q5_k_m.gguf
26+
curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-7b-chat-q5_k_m.gguf
2727
```
2828

2929
Run the inference application in WasmEdge.
3030

3131
```bash
32-
wasmedge --dir .:. \
33-
--nn-preload default:GGML:CPU:llama-2-7b.Q5_K_M.gguf llama-chat.wasm default \
34-
--prompt 'Robert Oppenheimer most important achievement is ' \
35-
--ctx-size 4096
32+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
3633
```
3734

38-
After executing the command, you may need to wait a moment for the input prompt to appear. Once the execution is complete, the following output will be generated.
35+
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
3936

4037
```bash
41-
Robert Oppenheimer most important achievement is
42-
1945 Manhattan Project.
43-
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
44-
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree.
38+
[USER]:
39+
I have two apples, each costing 5 dollars. What is the total cost of these apple
40+
[ASSISTANT]:
41+
The total cost of the two apples is 10 dollars.
42+
[USER]:
43+
How about four apples?
44+
[ASSISTANT]:
45+
The total cost of four apples is 20 dollars.
4546
```
4647

4748
## Build and run
@@ -59,91 +60,119 @@ Second, use `cargo` to build the example project.
5960
cargo build --target wasm32-wasi --release
6061
```
6162

62-
The output WASM file is `target/wasm32-wasi/release/llama-chat.wasm`.
63-
64-
We also need to get the model. Here we use the llama-2-13b model.
63+
The output WASM file is `target/wasm32-wasi/release/llama-chat.wasm`. Next, use WasmEdge to load the llama-2-7b model and then ask the model to questions.
6564

6665
```bash
67-
curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-13b-q5_k_m.gguf
66+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
6867
```
6968

70-
Next, use WasmEdge to load the llama-2-13b model and then ask the model to questions.
69+
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
7170

7271
```bash
73-
wasmedge --dir .:. \
74-
--nn-preload default:GGML:CPU:llama-2-13b.Q5_K_M.gguf llama-chat.wasm default \
75-
--prompt 'Robert Oppenheimer most important achievement is ' \
76-
--ctx-size 4096
72+
[USER]:
73+
Who is Robert Oppenheimer?
74+
[ASSISTANT]:
75+
Robert Oppenheimer was an American theoretical physicist and director of the Manhattan Project, which developed the atomic bomb during World War II. He is widely regarded as one of the most important physicists of the 20th century and is known for his contributions to the development of quantum mechanics and the theory of the atomic nucleus. Oppenheimer was also a prominent figure in the post-war nuclear weapons debate and was a strong advocate for international cooperation on nuclear weapons control.
7776
```
7877
79-
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
78+
## Options
79+
80+
You can configure the chat inference application through CLI options.
8081
8182
```bash
82-
Robert Oppenheimer most important achievement is
83-
1945 Manhattan Project.
84-
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
85-
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree
83+
-m, --model-alias <ALIAS>
84+
Model alias [default: default]
85+
-c, --ctx-size <CTX_SIZE>
86+
Size of the prompt context [default: 4096]
87+
-n, --n-predict <N_PRDICT>
88+
Number of tokens to predict [default: 1024]
89+
-g, --n-gpu-layers <N_GPU_LAYERS>
90+
Number of layers to run on the GPU [default: 100]
91+
-b, --batch-size <BATCH_SIZE>
92+
Batch size for prompt processing [default: 4096]
93+
-r, --reverse-prompt <REVERSE_PROMPT>
94+
Halt generation at PROMPT, return control.
95+
-s, --system-prompt <SYSTEM_PROMPT>
96+
System prompt message string [default: "[Default system message for the prompt template]"]
97+
-p, --prompt-template <TEMPLATE>
98+
Prompt template. [default: llama-2-chat] [possible values: llama-2-chat, codellama-instruct, mistral-instruct-v0.1, mistrallite, openchat, belle-llama-2-chat, vicuna-chat, chatml]
99+
--log-prompts
100+
Print prompt strings to stdout
101+
--log-stat
102+
Print statistics to stdout
103+
--log-all
104+
Print all log information to stdout
105+
--stream-stdout
106+
Print the output to stdout in the streaming way
107+
-h, --help
108+
Print help
86109
```
87110
88-
## Optional: Configure the model
111+
The `--prompt-template` option is perhaps the most interesting. It allows the application to support different open source LLM models beyond llama2.
89112
90-
You can use environment variables to configure the model execution.
113+
| Template name | Model | Download |
114+
| ------------ | ------------------------------ | --- |
115+
| llama-2-chat | [The standard llama2 chat model](https://ai.meta.com/llama/) | [7b](https://huggingface.co/wasmedge/llama2/resolve/main/llama-2-7b-chat-q5_k_m.gguf) |
116+
| codellama-instruct | [CodeLlama](https://about.fb.com/news/2023/08/code-llama-ai-for-coding/) | [7b](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF/resolve/main/codellama-7b-instruct.Q5_K_M.gguf) |
117+
| mistral-instruct-v0.1 | [Mistral](https://mistral.ai/) | [7b](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf) |
118+
| mistrallite | [Mistral Lite](https://huggingface.co/amazon/MistralLite) | [7b](https://huggingface.co/TheBloke/MistralLite-7B-GGUF/resolve/main/mistrallite.Q5_K_M.gguf) |
119+
| openchat | [OpenChat](https://github.com/imoneoi/openchat) | [7b](https://huggingface.co/TheBloke/openchat_3.5-GGUF/resolve/main/openchat_3.5.Q5_K_M.gguf) |
120+
| belle-llama-2-chat | [BELLE](https://github.com/LianjiaTech/BELLE) | [13b](https://huggingface.co/second-state/BELLE-Llama2-13B-Chat-0.4M-GGUF/resolve/main/BELLE-Llama2-13B-Chat-0.4M-ggml-model-q4_0.gguf) |
121+
| vicuna-chat | [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) | [7b](https://huggingface.co/TheBloke/vicuna-7B-v1.5-GGUF/resolve/main/vicuna-7b-v1.5.Q5_K_M.gguf) |
122+
| chatml | [ChatML](https://huggingface.co/chargoddard/rpguild-chatml-13b) | [13b](https://huggingface.co/TheBloke/rpguild-chatml-13B-GGUF/resolve/main/rpguild-chatml-13b.Q5_K_M.gguf) |
91123
92-
| Option |Default |Function |
93-
| -------|-----------|----- |
94-
| LLAMA_LOG | 0 |The backend will print diagnostic information when this value is set to 1|
95-
|LLAMA_N_CTX |512| The context length is the max number of tokens in the entire conversation|
96-
|LLAMA_N_PREDICT |512|The number of tokens to generate in each response from the model|
97124
98-
For example, the following command specifies a context length of 4k tokens, which is standard for llama2, and the max number of tokens in each response to be 1k. It also tells WasmEdge to print out logs and statistics of the model at runtime.
125+
Furthermore, the following command tells WasmEdge to print out logs and statistics of the model at runtime.
99126
100127
```
101-
LLAMA_LOG=1 LLAMA_N_CTX=4096 LLAMA_N_PREDICT=128 wasmedge --dir .:. \
102-
--nn-preload default:GGML:CPU:llama-2-7b.Q5_K_M.gguf llama-simple.wasm default \
103-
--prompt 'Robert Oppenheimer most important achievement is ' \
104-
--ctx-size 4096
105-
106-
...................................................................................................
107-
[2023-10-08 23:13:10.272] [info] [WASI-NN] GGML backend: set n_ctx to 4096
108-
llama_new_context_with_model: kv self size = 2048.00 MB
109-
llama_new_context_with_model: compute buffer total size = 297.47 MB
110-
llama_new_context_with_model: max tensor size = 102.54 MB
111-
[2023-10-08 23:13:10.472] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
112-
[2023-10-08 23:13:10.472] [info] [WASI-NN] GGML backend: set n_predict to 128
113-
[2023-10-08 23:13:16.014] [info] [WASI-NN] GGML backend: llama_get_kv_cache_token_count 128
114-
115-
llama_print_timings: load time = 1431.58 ms
116-
llama_print_timings: sample time = 3.53 ms / 118 runs ( 0.03 ms per token, 33446.71 tokens per second)
117-
llama_print_timings: prompt eval time = 1230.69 ms / 11 tokens ( 111.88 ms per token, 8.94 tokens per second)
118-
llama_print_timings: eval time = 4295.81 ms / 117 runs ( 36.72 ms per token, 27.24 tokens per second)
119-
llama_print_timings: total time = 5742.71 ms
120-
Robert Oppenheimer most important achievement is
121-
1945 Manhattan Project.
122-
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
123-
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree.
128+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
129+
llama-chat.wasm --prompt-template llama-2-chat --log-stat
130+
..................................................................................................
131+
llama_new_context_with_model: n_ctx = 512
132+
llama_new_context_with_model: freq_base = 10000.0
133+
llama_new_context_with_model: freq_scale = 1
134+
llama_new_context_with_model: kv self size = 256.00 MB
135+
llama_new_context_with_model: compute buffer total size = 76.63 MB
136+
[2023-11-07 02:07:44.019] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
137+
138+
llama_print_timings: load time = 11523.19 ms
139+
llama_print_timings: sample time = 2.62 ms / 102 runs ( 0.03 ms per token, 38961.04 tokens per second)
140+
llama_print_timings: prompt eval time = 11479.27 ms / 49 tokens ( 234.27 ms per token, 4.27 tokens per second)
141+
llama_print_timings: eval time = 13571.37 ms / 101 runs ( 134.37 ms per token, 7.44 tokens per second)
142+
llama_print_timings: total time = 25104.57 ms
143+
[ASSISTANT]:
144+
Ah, a fellow Peanuts enthusiast! Snoopy is Charlie Brown's lovable and imaginative beagle, known for his wild and wacky adventures in the comic strip and television specials. He's a loyal companion to Charlie Brown and the rest of the Peanuts gang, and his antics often provide comic relief in the series. Is there anything else you'd like to know about Snoopy? 🐶
124145
```
125146
126-
## Improve performance
147+
## Improving performance
127148
128149
You can make the inference program run faster by AOT compiling the wasm file first.
129150
130151
```bash
131152
wasmedge compile llama-chat.wasm llama-chat.wasm
132-
wasmedge --dir .:. \
133-
--nn-preload default:GGML:CPU:llama-2-13b-q5_k_m.gguf \
134-
llama-chat.wasm --model-alias default --prompt-template llama-2-chat
153+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
135154
```
136155
137156
## Understand the code
138157
139-
The [main.rs](https://github.com/second-state/llama-utils/blob/main/chat/src/main.rs
140-
) is the full Rust code to create an interactive chatbot using a LLM. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2 and other model’s chat templates, and runs the inference operations using the WASI NN standard API.
158+
The [main.rs](https://github.com/second-state/llama-utils/blob/main/chat/src/main.rs) is the full Rust code to create an interactive chatbot using a LLM. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2 and other model’s chat templates, and runs the inference operations using the WASI NN standard API. The code logic for the chat interaction is somewhat complex. In this section, we will use the [simple example](https://github.com/second-state/llama-utils/tree/main/simple) to explain how to set up and perform one inference round trip. Here is how you use the simple example.
159+
160+
```bash
161+
# Download the compiled simple inference wasm
162+
curl -LO https://github.com/second-state/llama-utils/raw/main/simple/llama-simple.wasm
163+
164+
# Give it a prompt and ask it to use the model to complete it.
165+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-simple.wasm \
166+
--prompt 'Robert Oppenheimer most important achievement is ' --ctx-size 4096
167+
168+
output: in 1942, when he led the team that developed the first atomic bomb, which was dropped on Hiroshima, Japan in 1945.
169+
```
141170
142171
First, let's parse command line arguments to customize the chatbot's behavior using `Command` struct. It extracts the following parameters: `prompt` (a prompt that guides the conversation), `model_alias` (a list for the loaded model), and `ctx_size` (the size of the chat context).
143172
144173
```rust
145174
fn main() -> Result<(), String> {
146-
let matches = Command::new("Llama API Server")
175+
let matches = Command::new("Simple LLM inference")
147176
.arg(
148177
Arg::new("prompt")
149178
.short('p')
@@ -216,14 +245,14 @@ Next, The prompt is converted into bytes and set as the input tensor for the mod
216245
.expect("Failed to set prompt as the input tensor");
217246
```
218247
219-
Next, excute the model inference.
248+
Next, execute the model inference.
220249
221250
```rust
222251
// execute the inference
223252
context.compute().expect("Failed to complete inference");
224253
```
225254
226-
After the inference is fiished, extract the result from the computation context and losing invalid UTF8 sequences handled by converting the output to a string using `String::from_utf8_lossy`.
255+
After the inference is finished, extract the result from the computation context and losing invalid UTF8 sequences handled by converting the output to a string using `String::from_utf8_lossy`.
227256
228257
```rust
229258
let mut output_buffer = vec![0u8; *CTX_SIZE.get().unwrap()];
@@ -241,8 +270,9 @@ println!("\nprompt: {}", &prompt);
241270
println!("\noutput: {}", output);
242271
```
243272
244-
The code explanation above is simple one time chat with llama 2 model. But we have more!
273+
## Resources
274+
275+
* If you're looking for multi-turn conversations with llama 2 models, please check out the above mentioned chat example source code [here](https://github.com/second-state/llama-utils/tree/main/chat).
276+
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the source code [for the API server](https://github.com/second-state/llama-utils/tree/main/api-server).
277+
* To learn more, please check out [this article](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359).
245278

246-
* If you're looking for continuous conversations with llama 2 models, please check out the source code [here](https://github.com/second-state/llama-utils/tree/main/chat).
247-
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the surce code [here](https://github.com/second-state/llama-utils/tree/main/api-server).
248-
* For the reason why we need to run LLama2 model with WasmEdge, please check out [this article](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359).

0 commit comments

Comments
 (0)