|
| 1 | +## Summary |
| 2 | +Qwen 3 is the latest iteration of the Qwen series of large language models (LLMs) developed by Alibaba. Edge-sized Qwen3 model variations (0.6B, 1.7B, and 4B) are currently supported . |
| 3 | + |
| 4 | +## Instructions |
| 5 | + |
| 6 | +Qwen 3 uses the same example code as our optimized Llama model, while the checkpoint, model params, and tokenizer are different. Please see the [Llama README page](../llama/README.md) for details. |
| 7 | + |
| 8 | +All commands for exporting and running Llama on various backends should also be applicable to Qwen 3, by swapping the following args: |
| 9 | +``` |
| 10 | +--model [qwen3-0.6b,qwen3-1_7b,qwen3-4b] |
| 11 | +--params [examples/models/qwen3/0_6b_config.json,examples/models/qwen3/1_7b_config.json,examples/models/qwen3/4b_config.json] |
| 12 | +``` |
| 13 | + |
| 14 | +### Example export |
| 15 | +Here is a basic example for exporting Qwen 3, although please refer to [Llama README page](../llama/README.md) for more advanced usage. |
| 16 | + |
| 17 | +Export 0.6b to XNNPack, quantized with 8da4w: |
| 18 | +``` |
| 19 | +python -m examples.models.llama.export_llama \ |
| 20 | + --model qwen3-0_6b \ |
| 21 | + --params examples/models/qwen3/0_6b_config.json \ |
| 22 | + -kv \ |
| 23 | + --use_sdpa_with_kv_cache \ |
| 24 | + -d fp32 \ |
| 25 | + -X \ |
| 26 | + --xnnpack-extended-ops \ |
| 27 | + --output_name="qwen3-0_6b.pte" \ |
| 28 | + --verbose |
| 29 | +``` |
| 30 | + |
| 31 | +Export 1.7b to XNNPack, quantized with 8da4w: |
| 32 | +``` |
| 33 | +python -m examples.models.llama.export_llama \ |
| 34 | + --model qwen3-1_7b \ |
| 35 | + --params examples/models/qwen3/1_7b_config.json \ |
| 36 | + -kv \ |
| 37 | + --use_sdpa_with_kv_cache \ |
| 38 | + -d fp32 \ |
| 39 | + -X \ |
| 40 | + --xnnpack-extended-ops \ |
| 41 | + --output_name="qwen3-1_7b.pte" \ |
| 42 | + --verbose |
| 43 | +``` |
| 44 | + |
| 45 | +Export 4b to XNNPack, quantized with 8da4w: |
| 46 | +``` |
| 47 | +python -m examples.models.llama.export_llama \ |
| 48 | + --model qwen3-4b \ |
| 49 | + --params examples/models/qwen3/4b_config.json \ |
| 50 | + -kv \ |
| 51 | + --use_sdpa_with_kv_cache \ |
| 52 | + -d fp32 \ |
| 53 | + -X \ |
| 54 | + --xnnpack-extended-ops \ |
| 55 | + --output_name="qwen3-4b.pte" \ |
| 56 | + --verbose |
| 57 | +``` |
| 58 | + |
| 59 | +### Example run |
| 60 | +With ExecuTorch pybindings: |
| 61 | +``` |
| 62 | +python -m examples.models.llama.runner.native |
| 63 | + --model qwen3-0_6b \ |
| 64 | + --pte qwen3-0_6b.pte \ |
| 65 | + --tokenizer ~/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/a9c98e602b9d36d2a2f7ba1eb0f5f31e4e8e5143/tokenizer.json \ |
| 66 | + --tokenizer_config ~/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/a9c98e602b9d36d2a2f7ba1eb0f5f31e4e8e5143/tokenizer_config.json \ |
| 67 | + --prompt "Who is the president of the US?" \ |
| 68 | + --params examples/models/qwen3/0_6b_config.json \ |
| 69 | + --max_len 128 \ |
| 70 | + -kv \ |
| 71 | + --temperature 0.6 |
| 72 | +``` |
| 73 | + |
| 74 | +With ExecuTorch's sample c++ runner (see the Llama README's [Step 3: Run on your computer to validate](../llama/README.md#step-3-run-on-your-computer-to-validate) to build the runner): |
| 75 | +``` |
| 76 | +cmake-out/examples/models/llama/llama_main |
| 77 | + --model_path qwen3-0_6b.pte |
| 78 | + --tokenizer_path ~/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/a9c98e602b9d36d2a2f7ba1eb0f5f31e4e8e5143/tokenizer.json |
| 79 | + --prompt="Who is the president of the US?" |
| 80 | +``` |
| 81 | + |
| 82 | +To run the model on an example iOS or Android app, see the Llama README's [Step 5: Build Mobile apps](../llama/README.md#step-5-build-mobile-apps) section. |
0 commit comments