Demonstration of running a native Large Language Model (LLM) on Android devices. Currently supported models include:
- Qwen3: 0.6B, 1.7B, 4B...
- Qwen2.5-Instruct: 0.5B, 1.5B, 3B...
- Qwen2.5VL: 3B
- DeepSeek-R1-Distill-Qwen: 1.5B
- MiniCPM-DPO/SFT: 1B, 2.7B
- Gemma-3-it: 1B, 4B...
- Phi-4-mini-Instruct: 3.8B
- Llama-3.2-Instruct: 1B
- InternVL-Mono: 2B
- InternLM-3: 8B
- 2025/04/29:Update Qwen3.
- 2025/04/05:Update Qwen2.5, InternVL-Mono
q4f32
+dynamic_axes
. - 2025/02/22:Support loading with low memory mode:
Qwen
,QwenVL
,MiniCPM_2B_single
; Setlow_memory_mode = true
inMainActivity.java
. - 2025/02/07:DeepSeek-R1-Distill-Qwen: 1.5B (Please using
Qwen v2.5 Qwen_Export.py
)
-
Download Models:
- Quick Try: Qwen3-1.7B-Android
-
Setup Instructions:
- Place the downloaded model files into the
assets
folder. - Decompress the
*.so
files stored in thelibs/arm64-v8a
folder.
- Place the downloaded model files into the
-
Model Notes:
- Demo models are converted from HuggingFace or ModelScope and optimized for extreme execution speed.
- Inputs and outputs may differ slightly from the original models.
- For Qwen2VL / Qwen2.5VL, adjust the key variables to match the model parameters.
GLRender.java: Line 37, 38, 39
project.h: Line 14, 15, 16, 35, 36, 41, 59, 60
-
ONNX Export Considerations:
- It is recommended to use dynamic axes and q4f32 quantization.
- The
tokenizer.cpp
andtokenizer.hpp
files are sourced from the mnn-llm repository.
- Navigate to the
Export_ONNX
folder. - Follow the comments in the Python scripts to set the folder paths.
- Execute the
***_Export.py
script to export the model. - Quantize or optimize the ONNX model manually.
- Use
onnxruntime.tools.convert_onnx_models_to_ort
to convert models to*.ort
format. Note that this process automatically addsCast
operators that change FP16 multiplication to FP32. - The quantization methods are detailed in the
Do_Quantize
folder.
- Explore more projects: DakeQQ Projects
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Qwen-2-1.5B-Instruct q8f32 |
20 token/s |
Android 15 | Vivo x200 Pro | MediaTek_9400-CPU | Qwen-3-1.7B-Instruct q4f32 dynamic |
37 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen-3-1.7B-Instruct q4f32 dynamic |
18.5 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen-2.5-1.5B-Instruct q4f32 dynamic |
20.5 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen-2-1.5B-Instruct q8f32 |
13 token/s |
Harmony 3 | 荣耀 20S | Kirin_810-CPU | Qwen-2-1.5B-Instruct q8f32 |
7 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | QwenVL-2-2B q8f32 |
15 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | QwenVL-2-2B q8f32 |
9 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | QwenVL-2.5-3B q4f32 dynamic |
9 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Distill-Qwen-1.5B q4f32 dynamic |
34.5 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Distill-Qwen-1.5B q4f32 dynamic |
20.5 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Distill-Qwen-1.5B q8f32 |
13 token/s |
HyperOS 2 | Xiaomi-14T-Pro | MediaTek_9300+-CPU | Distill-Qwen-1.5B q8f32 |
22 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 15 | Nubia Z50 | 8_Gen2-CPU | MiniCPM4-0.5B q4f32 |
78 token/s |
Android 13 | Nubia Z50 | 8_Gen2-CPU | MiniCPM-2.7B q8f32 |
9.5 token/s |
Android 13 | Nubia Z50 | 8_Gen2-CPU | MiniCPM-1.3B q8f32 |
16.5 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | MiniCPM-2.7B q8f32 |
6 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | MiniCPM-1.3B q8f32 |
11 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Gemma-1.1-it-2B q8f32 |
16 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Phi-2-2B-Orange-V2 q8f32 |
9.5 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Phi-2-2B-Orange-V2 q8f32 |
5.8 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Llama-3.2-1B-Instruct q8f32 |
25 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Llama-3.2-1B-Instruct q8f32 |
16 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Harmony 4 | P40 | Kirin_990_5G-CPU | Mono-2B-S1-3 q4f32 dynamic |
10.5 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 15 | Nubia Z50 | 8_Gen2-CPU | MiniCPM4-0.5B q4f32 |
78 token/s |
展示在 Android 设备上运行原生大型语言模型 (LLM) 的示范。目前支持的模型包括:
- Qwen3: 0.6B, 1.7B, 4B...
- Qwen2.5-Instruct: 0.5B, 1.5B, 3B...
- Qwen2.5VL: 3B
- DeepSeek-R1-Distill-Qwen: 1.5B
- MiniCPM-DPO/SFT: 1B, 2.7B
- Gemma-3-it: 1B, 4B...
- Phi-4-mini-Instruct: 3.8B
- Llama-3.2-Instruct: 1B
- InternVL-Mono: 2B
- InternLM-3: 8B
- 2025/04/29:更新 Qwen3。
- 2025/04/05: 更新 Qwen2.5, InternVL-Mono
q4f32
+dynamic_axes
。 - 2025/02/22:支持低内存模式加载:
Qwen
,QwenVL
,MiniCPM_2B_single
; Setlow_memory_mode = true
inMainActivity.java
. - 2025/02/07:DeepSeek-R1-Distill-Qwen: 1.5B (请使用
Qwen v2.5 Qwen_Export.py
)。
-
下载模型:
- Quick Try: Qwen3-1.7B-Android
-
设置说明:
- 将下载的模型文件放入
assets
文件夹。 - 解压存储在
libs/arm64-v8a
文件夹中的*.so
文件。
- 将下载的模型文件放入
-
模型说明:
- 演示模型是从 HuggingFace 或 ModelScope 转换而来,并针对极限执行速度进行了优化。
- 输入和输出可能与原始模型略有不同。
- 对于Qwen2VL / Qwen2.5VL,请调整关键变量以匹配模型参数。
GLRender.java: Line 37, 38, 39
project.h: Line 14, 15, 16, 35, 36, 41, 59, 60
-
ONNX 导出注意事项:
- 推荐使用动态轴以及
q4f32
量化。
- 推荐使用动态轴以及
tokenizer.cpp
和tokenizer.hpp
文件来源于 mnn-llm 仓库。
- 进入
Export_ONNX
文件夹。 - 按照 Python 脚本中的注释设置文件夹路径。
- 执行
***_Export.py
脚本以导出模型。 - 手动量化或优化 ONNX 模型。
- 使用
onnxruntime.tools.convert_onnx_models_to_ort
将模型转换为*.ort
格式。注意该过程会自动添加Cast
操作符,将 FP16 乘法改为 FP32。 - 量化方法详见
Do_Quantize
文件夹。
- 探索更多项目:DakeQQ Projects