Skip to content

DakeQQ/Native-LLM-for-Android

Repository files navigation

Native-LLM-for-Android

Overview

Demonstration of running a native Large Language Model (LLM) on Android devices. Currently supported models include:

  • Qwen3: 0.6B, 1.7B, 4B...
  • Qwen2.5-Instruct: 0.5B, 1.5B, 3B...
  • Qwen2.5VL: 3B
  • DeepSeek-R1-Distill-Qwen: 1.5B
  • MiniCPM-DPO/SFT: 1B, 2.7B
  • Gemma-3-it: 1B, 4B...
  • Phi-4-mini-Instruct: 3.8B
  • Llama-3.2-Instruct: 1B
  • InternVL-Mono: 2B
  • InternLM-3: 8B

Recent Updates

  • 2025/04/29:Update Qwen3.
  • 2025/04/05:Update Qwen2.5, InternVL-Mono q4f32 + dynamic_axes.
  • 2025/02/22:Support loading with low memory mode: Qwen, QwenVL, MiniCPM_2B_single; Set low_memory_mode = true in MainActivity.java.
  • 2025/02/07:DeepSeek-R1-Distill-Qwen: 1.5B (Please using Qwen v2.5 Qwen_Export.py)

Getting Started

  1. Download Models:

  2. Setup Instructions:

    • Place the downloaded model files into the assets folder.
    • Decompress the *.so files stored in the libs/arm64-v8a folder.
  3. Model Notes:

    • Demo models are converted from HuggingFace or ModelScope and optimized for extreme execution speed.
    • Inputs and outputs may differ slightly from the original models.
    • For Qwen2VL / Qwen2.5VL, adjust the key variables to match the model parameters.
      • GLRender.java: Line 37, 38, 39
      • project.h: Line 14, 15, 16, 35, 36, 41, 59, 60
  4. ONNX Export Considerations:

    • It is recommended to use dynamic axes and q4f32 quantization.

Tokenizer Files

Exporting Models

  1. Navigate to the Export_ONNX folder.
  2. Follow the comments in the Python scripts to set the folder paths.
  3. Execute the ***_Export.py script to export the model.
  4. Quantize or optimize the ONNX model manually.

Quantization Notes

  • Use onnxruntime.tools.convert_onnx_models_to_ort to convert models to *.ort format. Note that this process automatically adds Cast operators that change FP16 multiplication to FP32.
  • The quantization methods are detailed in the Do_Quantize folder.

Additional Resources

Performance Metrics

Qwen

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU Qwen-2-1.5B-Instruct
q8f32
20 token/s
Android 15 Vivo x200 Pro MediaTek_9400-CPU Qwen-3-1.7B-Instruct
q4f32
dynamic
37 token/s
Harmony 4 P40 Kirin_990_5G-CPU Qwen-3-1.7B-Instruct
q4f32
dynamic
18.5 token/s
Harmony 4 P40 Kirin_990_5G-CPU Qwen-2.5-1.5B-Instruct
q4f32
dynamic
20.5 token/s
Harmony 4 P40 Kirin_990_5G-CPU Qwen-2-1.5B-Instruct
q8f32
13 token/s
Harmony 3 荣耀 20S Kirin_810-CPU Qwen-2-1.5B-Instruct
q8f32
7 token/s

QwenVL

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU QwenVL-2-2B
q8f32
15 token/s
Harmony 4 P40 Kirin_990_5G-CPU QwenVL-2-2B
q8f32
9 token/s
Harmony 4 P40 Kirin_990_5G-CPU QwenVL-2.5-3B
q4f32
dynamic
9 token/s

DeepSeek-R1

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU Distill-Qwen-1.5B
q4f32
dynamic
34.5 token/s
Harmony 4 P40 Kirin_990_5G-CPU Distill-Qwen-1.5B
q4f32
dynamic
20.5 token/s
Harmony 4 P40 Kirin_990_5G-CPU Distill-Qwen-1.5B
q8f32
13 token/s
HyperOS 2 Xiaomi-14T-Pro MediaTek_9300+-CPU Distill-Qwen-1.5B
q8f32
22 token/s

MiniCPM

OS Device Backend Model Inference (1024 Context)
Android 15 Nubia Z50 8_Gen2-CPU MiniCPM4-0.5B
q4f32
78 token/s
Android 13 Nubia Z50 8_Gen2-CPU MiniCPM-2.7B
q8f32
9.5 token/s
Android 13 Nubia Z50 8_Gen2-CPU MiniCPM-1.3B
q8f32
16.5 token/s
Harmony 4 P40 Kirin_990_5G-CPU MiniCPM-2.7B
q8f32
6 token/s
Harmony 4 P40 Kirin_990_5G-CPU MiniCPM-1.3B
q8f32
11 token/s

Gemma

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU Gemma-1.1-it-2B
q8f32
16 token/s

Phi

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU Phi-2-2B-Orange-V2
q8f32
9.5 token/s
Harmony 4 P40 Kirin_990_5G-CPU Phi-2-2B-Orange-V2
q8f32
5.8 token/s

Llama

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU Llama-3.2-1B-Instruct
q8f32
25 token/s
Harmony 4 P40 Kirin_990_5G-CPU Llama-3.2-1B-Instruct
q8f32
16 token/s

InternVL

OS Device Backend Model Inference (1024 Context)
Harmony 4 P40 Kirin_990_5G-CPU Mono-2B-S1-3
q4f32
dynamic
10.5 token/s

MiniCPM

OS Device Backend Model Inference (1024 Context)
Android 15 Nubia Z50 8_Gen2-CPU MiniCPM4-0.5B
q4f32
78 token/s

Demo Results

Qwen2VL-2B / 1024 Context

Demo Animation

Qwen2-1.5B / 1024 Context

Demo Animation

概述

展示在 Android 设备上运行原生大型语言模型 (LLM) 的示范。目前支持的模型包括:

  • Qwen3: 0.6B, 1.7B, 4B...
  • Qwen2.5-Instruct: 0.5B, 1.5B, 3B...
  • Qwen2.5VL: 3B
  • DeepSeek-R1-Distill-Qwen: 1.5B
  • MiniCPM-DPO/SFT: 1B, 2.7B
  • Gemma-3-it: 1B, 4B...
  • Phi-4-mini-Instruct: 3.8B
  • Llama-3.2-Instruct: 1B
  • InternVL-Mono: 2B
  • InternLM-3: 8B

最近更新

  • 2025/04/29:更新 Qwen3。
  • 2025/04/05: 更新 Qwen2.5, InternVL-Mono q4f32 + dynamic_axes
  • 2025/02/22:支持低内存模式加载: Qwen, QwenVL, MiniCPM_2B_single; Set low_memory_mode = true in MainActivity.java.
  • 2025/02/07:DeepSeek-R1-Distill-Qwen: 1.5B (请使用 Qwen v2.5 Qwen_Export.py)。

入门指南

  1. 下载模型:

  2. 设置说明:

    • 将下载的模型文件放入 assets 文件夹。
    • 解压存储在 libs/arm64-v8a 文件夹中的 *.so 文件。
  3. 模型说明:

    • 演示模型是从 HuggingFace 或 ModelScope 转换而来,并针对极限执行速度进行了优化。
    • 输入和输出可能与原始模型略有不同。
    • 对于Qwen2VL / Qwen2.5VL,请调整关键变量以匹配模型参数。
      • GLRender.java: Line 37, 38, 39
      • project.h: Line 14, 15, 16, 35, 36, 41, 59, 60
  4. ONNX 导出注意事项:

    • 推荐使用动态轴以及q4f32量化。

分词器文件

导出模型

  1. 进入 Export_ONNX 文件夹。
  2. 按照 Python 脚本中的注释设置文件夹路径。
  3. 执行 ***_Export.py 脚本以导出模型。
  4. 手动量化或优化 ONNX 模型。

量化说明

  • 使用 onnxruntime.tools.convert_onnx_models_to_ort 将模型转换为 *.ort 格式。注意该过程会自动添加 Cast 操作符,将 FP16 乘法改为 FP32。
  • 量化方法详见 Do_Quantize 文件夹。

额外资源

About

Demonstration of running a native LLM on Android device.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published