Skip to content

Commit 2cf37a8

Browse files
committed
Update llm_inference.md
1 parent e053703 commit 2cf37a8

File tree

1 file changed

+9
-6
lines changed

1 file changed

+9
-6
lines changed

docs/develop/rust/wasinn/llm_inference.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ sidebar_position: 1
66

77
WasmEdge now supports running open-source Large Language Models (LLMs) in Rust. We will use [this example project](https://github.com/second-state/LlamaEdge/tree/main/chat) to show how to make AI inferences with the llama-3.1-8B model in WasmEdge and Rust.
88

9-
Basically, WasmEdge can support any open-source LLMs. Please check [the supported models](https://github.com/second-state/LlamaEdge/blob/main/models.md) for details.
9+
Furthermore, WasmEdge can support any open-source LLMs. Please check [the supported models](https://github.com/second-state/LlamaEdge/blob/main/models.md) for details.
1010

1111
## Prerequisite
1212

@@ -31,7 +31,7 @@ curl -LO https://huggingface.co/second-state/Meta-Llama-3.1-8B-Instruct-GGUF/res
3131
Run the inference application in WasmEdge.
3232

3333
```bash
34-
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-a-chat
34+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-3-chat
3535
```
3636

3737
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
@@ -119,7 +119,10 @@ You can configure the chat inference application through CLI options.
119119

120120
The `--prompt-template` option is perhaps the most interesting. It allows the application to support different open source LLM models beyond llama2. Check out more prompt templates [here](https://github.com/LlamaEdge/LlamaEdge/tree/main/api-server/chat-prompts).
121121

122-
Furthermore, the following command tells WasmEdge to print out logs and statistics of the model at runtime.
122+
The `--ctx-size` option specifies the context windows size of the application. It is limited by the model's intrinsic context window size. If you increase the `--ctx-size`, make sure that you also
123+
explicitly specify the `--batch-size` to a reasonable value (e.g., `--batch-size 512`).
124+
125+
The following command tells WasmEdge to print out logs and statistics of the model at runtime.
123126

124127
```bash
125128
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
@@ -147,12 +150,12 @@ You can make the inference program run faster by AOT compiling the wasm file fir
147150
148151
```bash
149152
wasmedge compile llama-chat.wasm llama-chat.wasm
150-
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf llama-chat.wasm
153+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-3-chat
151154
```
152155
153156
## Understand the code
154157
155-
The [main.rs](https://github.com/second-state/llamaedge/blob/main/chat/src/main.rs) is the full Rust code to create an interactive chatbot using a LLM. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2 and other model’s chat templates, and runs the inference operations using the WASI NN standard API. The code logic for the chat interaction is somewhat complex. In this section, we will use the [simple example](https://github.com/second-state/llamaedge/tree/main/simple) to explain how to set up and perform one inference round trip. Here is how you use the simple example.
158+
The [main.rs](https://github.com/second-state/llamaedge/blob/main/chat/src/main.rs) is the full Rust code to create an interactive chatbot using a LLM. The Rust program manages the user input, tracks the conversation history, transforms the text into the model’s chat templates, and runs the inference operations using the WASI NN standard API. The code logic for the chat interaction is somewhat complex. In this section, we will use the [simple example](https://github.com/second-state/llamaedge/tree/main/simple) to explain how to set up and perform one inference round trip. Here is how you use the simple example.
156159
157160
```bash
158161
# Download the compiled simple inference wasm
@@ -269,6 +272,6 @@ println!("\noutput: {}", output);
269272
270273
## Resources
271274
272-
* If you're looking for multi-turn conversations with llama 2 models, please check out the above mentioned chat example source code [here](https://github.com/second-state/llamaedge/tree/main/chat).
275+
* If you're looking for multi-turn conversations with llama models, please check out the above mentioned chat example source code [here](https://github.com/second-state/llamaedge/tree/main/chat).
273276
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the source code [for the API server](https://github.com/second-state/llamaedge/tree/main/api-server).
274277
* To learn more, please check out [this article](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359).

0 commit comments

Comments
 (0)