Skip to content

Commit 0a3accf

Browse files
committed
Signed-off-by: alabulei1 <vivian.xiage@gmail.com>
1 parent 55b4dbe commit 0a3accf

File tree

5 files changed

+517
-6
lines changed

5 files changed

+517
-6
lines changed
Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
---
2+
sidebar_position: 1
3+
---
4+
5+
# Llama 2 inference
6+
7+
WasmEdge now supports running llama2 series of models in Rust. We will use [this example project](https://github.com/second-state/llama-utils/tree/main/chat) to show how to make AI inferences with the llama2 model in WasmEdge and Rust.
8+
9+
WasmEdge now supports Llama2, Codellama-instruct, BELLE-Llama, Mistral-7b-instruct, Wizard-vicuna, OpenChat 3.5B and raguile-chatml.
10+
11+
## Prerequisite
12+
13+
Besides the [regular WasmEdge and Rust requirements](../../rust/setup.md), please make sure that you have the [Wasi-NN plugin with ggml installed](../../../start/install.md#wasi-nn-plug-in-with-ggml-backend).
14+
15+
## Quick start
16+
Because the example already includes a compiled WASM file from the Rust code, we could use WasmEdge CLI to execute the example directly. First, git clone the `llama-utils` repo.
17+
18+
```bash
19+
git clone https://github.com/second-state/llama-utils.git
20+
cd chat
21+
```
22+
23+
Next, let's get the model. In this example, we are going to use the llama2 7b chat model in GGUF format. You can also use other kinds of llama2 models, check out [here](https://github.com/second-state/llama-utils/blob/main/chat/README.md#get-model).
24+
25+
```bash
26+
git clone curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-7b-chat-q5_k_m.gguf
27+
```
28+
29+
Run the inference application in WasmEdge.
30+
31+
```bash
32+
wasmedge --dir .:. \
33+
--nn-preload default:GGML:CPU:llama-2-7b.Q5_K_M.gguf llama-chat.wasm default \
34+
--prompt 'Robert Oppenheimer most important achievement is ' \
35+
--ctx-size 4096
36+
```
37+
38+
After executing the command, you may need to wait a moment for the input prompt to appear. Once the execution is complete, the following output will be generated.
39+
40+
```bash
41+
Robert Oppenheimer most important achievement is
42+
1945 Manhattan Project.
43+
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
44+
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree.
45+
```
46+
47+
## Build and run
48+
49+
Let's build the wasm file from the rust source code. First, git clone the `llama-utils` repo.
50+
51+
```bash
52+
git clone https://github.com/second-state/llama-utils.git
53+
cd chat
54+
```
55+
56+
Second, use `cargo` to build the example project.
57+
58+
```bash
59+
cargo build --target wasm32-wasi --release
60+
```
61+
62+
The output WASM file is `target/wasm32-wasi/release/llama-chat.wasm`.
63+
64+
We also need to get the model. Here we use the llama-2-13b model.
65+
66+
```bash
67+
curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-13b-q5_k_m.gguf
68+
```
69+
70+
Next, use WasmEdge to load the llama-2-13b model and then ask the model to questions.
71+
72+
```bash
73+
wasmedge --dir .:. \
74+
--nn-preload default:GGML:CPU:llama-2-13b.Q5_K_M.gguf llama-chat.wasm default \
75+
--prompt 'Robert Oppenheimer most important achievement is ' \
76+
--ctx-size 4096
77+
```
78+
79+
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
80+
81+
```bash
82+
Robert Oppenheimer most important achievement is
83+
1945 Manhattan Project.
84+
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
85+
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree
86+
```
87+
88+
## Optional: Configure the model
89+
90+
You can use environment variables to configure the model execution.
91+
92+
| Option |Default |Function |
93+
| -------|-----------|----- |
94+
| LLAMA_LOG | 0 |The backend will print diagnostic information when this value is set to 1|
95+
|LLAMA_N_CTX |512| The context length is the max number of tokens in the entire conversation|
96+
|LLAMA_N_PREDICT |512|The number of tokens to generate in each response from the model|
97+
98+
For example, the following command specifies a context length of 4k tokens, which is standard for llama2, and the max number of tokens in each response to be 1k. It also tells WasmEdge to print out logs and statistics of the model at runtime.
99+
100+
```
101+
LLAMA_LOG=1 LLAMA_N_CTX=4096 LLAMA_N_PREDICT=128 wasmedge --dir .:. \
102+
--nn-preload default:GGML:CPU:llama-2-7b.Q5_K_M.gguf llama-simple.wasm default \
103+
--prompt 'Robert Oppenheimer most important achievement is ' \
104+
--ctx-size 4096
105+
106+
...................................................................................................
107+
[2023-10-08 23:13:10.272] [info] [WASI-NN] GGML backend: set n_ctx to 4096
108+
llama_new_context_with_model: kv self size = 2048.00 MB
109+
llama_new_context_with_model: compute buffer total size = 297.47 MB
110+
llama_new_context_with_model: max tensor size = 102.54 MB
111+
[2023-10-08 23:13:10.472] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
112+
[2023-10-08 23:13:10.472] [info] [WASI-NN] GGML backend: set n_predict to 128
113+
[2023-10-08 23:13:16.014] [info] [WASI-NN] GGML backend: llama_get_kv_cache_token_count 128
114+
115+
llama_print_timings: load time = 1431.58 ms
116+
llama_print_timings: sample time = 3.53 ms / 118 runs ( 0.03 ms per token, 33446.71 tokens per second)
117+
llama_print_timings: prompt eval time = 1230.69 ms / 11 tokens ( 111.88 ms per token, 8.94 tokens per second)
118+
llama_print_timings: eval time = 4295.81 ms / 117 runs ( 36.72 ms per token, 27.24 tokens per second)
119+
llama_print_timings: total time = 5742.71 ms
120+
Robert Oppenheimer most important achievement is
121+
1945 Manhattan Project.
122+
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
123+
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree.
124+
```
125+
126+
## Improve performance
127+
128+
You can make the inference program run faster by AOT compiling the wasm file first.
129+
130+
```bash
131+
wasmedge compile llama-chat.wasm llama-chat.wasm
132+
wasmedge --dir .:. \
133+
--nn-preload default:GGML:CPU:llama-2-13b-q5_k_m.gguf \
134+
llama-chat.wasm --model-alias default --prompt-template llama-2-chat
135+
```
136+
137+
## Understand the code
138+
139+
The [main.rs](https://github.com/second-state/llama-utils/blob/main/chat/src/main.rs
140+
) is the full Rust code to create an interactive chatbot using a LLM. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2 and other model’s chat templates, and runs the inference operations using the WASI NN standard API.
141+
142+
First, let's parse command line arguments to customize the chatbot's behavior using `Command` struct. It extracts the following parameters: `prompt` (a prompt that guides the conversation), `model_alias` (a list for the loaded model), and `ctx_size` (the size of the chat context).
143+
144+
```rust
145+
fn main() -> Result<(), String> {
146+
let matches = Command::new("Llama API Server")
147+
.arg(
148+
Arg::new("prompt")
149+
.short('p')
150+
.long("prompt")
151+
.value_name("PROMPT")
152+
.help("Sets the prompt.")
153+
.required(true),
154+
)
155+
.arg(
156+
Arg::new("model_alias")
157+
.short('m')
158+
.long("model-alias")
159+
.value_name("ALIAS")
160+
.help("Sets the model alias")
161+
.default_value("default"),
162+
)
163+
.arg(
164+
Arg::new("ctx_size")
165+
.short('c')
166+
.long("ctx-size")
167+
.value_parser(clap::value_parser!(u32))
168+
.value_name("CTX_SIZE")
169+
.help("Sets the prompt context size")
170+
.default_value(DEFAULT_CTX_SIZE),
171+
)
172+
.get_matches();
173+
174+
// model alias
175+
let model_name = matches
176+
.get_one::<String>("model_alias")
177+
.unwrap()
178+
.to_string();
179+
180+
// prompt context size
181+
let ctx_size = matches.get_one::<u32>("ctx_size").unwrap();
182+
CTX_SIZE
183+
.set(*ctx_size as usize)
184+
.expect("Fail to parse prompt context size");
185+
186+
// prompt
187+
let prompt = matches.get_one::<String>("prompt").unwrap().to_string();
188+
```
189+
190+
After that, the program will create a new Graph using the `GraphBuilder` and loads the model specified by the `model_name` .
191+
192+
```rust
193+
// load the model to wasi-nn
194+
let graph =
195+
wasi_nn::GraphBuilder::new(wasi_nn::GraphEncoding::Ggml, wasi_nn::ExecutionTarget::AUTO)
196+
.build_from_cache(&model_name)
197+
.expect("Failed to load the model");
198+
```
199+
200+
Next, We create an execution context from the loaded Graph. The context is mutable because we will be changing it when we set the input tensor and execute the inference.
201+
202+
```rust
203+
// initialize the execution context
204+
let mut context = graph
205+
.init_execution_context()
206+
.expect("Failed to init context");
207+
```
208+
209+
Next, The prompt is converted into bytes and set as the input tensor for the model inference.
210+
211+
```rust
212+
// set input tensor
213+
let tensor_data = prompt.as_str().as_bytes().to_vec();
214+
context
215+
.set_input(0, wasi_nn::TensorType::U8, &[1], &tensor_data)
216+
.expect("Failed to set prompt as the input tensor");
217+
```
218+
219+
Next, excute the model inference.
220+
221+
```rust
222+
// execute the inference
223+
context.compute().expect("Failed to complete inference");
224+
```
225+
226+
After the inference is fiished, extract the result from the computation context and losing invalid UTF8 sequences handled by converting the output to a string using `String::from_utf8_lossy`.
227+
228+
```rust
229+
let mut output_buffer = vec![0u8; *CTX_SIZE.get().unwrap()];
230+
let mut output_size = context
231+
.get_output(0, &mut output_buffer)
232+
.expect("Failed to get output tensor");
233+
output_size = std::cmp::min(*CTX_SIZE.get().unwrap(), output_size);
234+
let output = String::from_utf8_lossy(&output_buffer[..output_size]).to_string();
235+
```
236+
237+
Finally, print the prompt and the inference output to the console.
238+
239+
```rust
240+
println!("\nprompt: {}", &prompt);
241+
println!("\noutput: {}", output);
242+
```
243+
244+
The code explanation above is simple one time chat with llama 2 model. But we have more!
245+
246+
* If you're looking for continuous conversations with llama 2 models, please check out the source code [here](https://github.com/second-state/llama-utils/tree/main/chat).
247+
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the surce code [here](https://github.com/second-state/llama-utils/tree/main/api-server).
248+
* For the reason why we need to run LLama2 model with WasmEdge, please check out [this article](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359).

docs/develop/rust/wasinn/mediapipe.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
sidebar_position: 1
2+
sidebar_position: 2
33
---
44

55
# Mediapipe solutions
@@ -19,7 +19,7 @@ git clone https://github.com/juntao/demo-object-detection
1919
cd demo-object-detection/
2020
```
2121

22-
Build an inference application using the Mediapipe object dection model.
22+
Build an inference application using the Mediapipe object detection model.
2323

2424
```bash
2525
cargo build --target wasm32-wasi --release
@@ -69,7 +69,7 @@ let detector = ObjectDetectorBuilder::new()
6969
.build_from_buffer(model_data)?;
7070
```
7171

72-
The `detect()` function takes in an image, pre-processes it into a tensor array, runs inference on the mediapipe object detection model, and the post-processes the returned tensor array into a human redable format stored in the `detection_result`.
72+
The `detect()` function takes in an image, pre-processes it into a tensor array, runs inference on the mediapipe object detection model, and the post-processes the returned tensor array into a human readable format stored in the `detection_result`.
7373

7474
```rust
7575
let mut input_img = image::open(img_path)?;

docs/start/install.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,12 +145,27 @@ Then, go to [HTTPS request in Rust chapter](../develop/rust/http_service/client.
145145

146146
WasmEdge supports various backends for `WASI-NN`.
147147

148+
- [ggml backend](#wasi-nn-plug-in-with-ggml-backend): supported on `Ubuntu above 20.04` (x86_64), macOS (M1 and M2), and GPU (NVIDIA).
148149
- [PyTorch backend](#wasi-nn-plug-in-with-pytorch-backend): supported on `Ubuntu above 20.04` and `manylinux2014_x86_64`.
149150
- [OpenVINO™ backend](#wasi-nn-plug-in-with-openvino-backend): supported on `Ubuntu above 20.04`.
150151
- [TensorFlow-Lite backend](#wasi-nn-plug-in-with-tensorflow-lite-backend): supported on `Ubuntu above 20.04`, `manylinux2014_x86_64`, and `manylinux2014_aarch64`.
151152

152153
Noticed that the backends are exclusive. Developers can only choose and install one backend for the `WASI-NN` plug-in.
153154

155+
#### WASI-NN plug-in with ggml backend
156+
157+
`WASI-NN plug-in` with `ggml` backend allows WasmEdge to run llama2 inference. To install WasmEdge with WASI-NN ggml backend on, please use `--plugin wasi_nn-ggml` when running the installer command.
158+
159+
Please note, the installer from WasmEdge 0.13.5 will detect CUDA automatically. If CUDA is detected, the installer will always attempt to install a CUDA-enabled version of the plug-in.
160+
161+
If CPU is the only available hardware on your machine, the installer will install OpenBLAS version of plugin instead.
162+
163+
```bash
164+
apt update && apt install -y libopenblas-dev # You may need sudo if the user is not root.
165+
```
166+
167+
Then, go to the [Llama2 inference in Rust chapter](../develop/rust/wasinn/llm-inference) to see how to run AI inference with llama2 series of models.
168+
154169
#### WASI-NN plug-in with PyTorch backend
155170

156171
`WASI-NN` plug-in with `PyTorch` backend allows WasmEdge applications to perform `PyTorch` model inference. To install WasmEdge with `WASI-NN PyTorch backend` plug-in on Linux, please use the `--plugins wasi_nn-pytorch` parameter when [running the installer command](#generic-linux-and-macos).

0 commit comments

Comments
 (0)