An Android application that brings a large language model (LLM) to your phone — fully offline, no internet needed. Powered by ONNX Runtime and a Hugging Face-compatible tokenizer, it provides fast, private, on-device question answering with streaming responses.
- 📱 Fully on-device LLM inference with ONNX Runtime
- 🔤 Hugging Face-compatible BPE tokenizer (
tokenizer.json
) - 🧠 Qwen2.5 & Qwen3 prompt formatting with streaming generation
- 🧩 Custom
ModelConfig
for precision, prompt style, and KV cache - 🧘♂️ Thinking Mode toggle (enabled in Qwen3) for step-by-step reasoning
- 🚀 Coroutine-based UI for smooth user experience
- 🔐 Runs 100% offline — no network, no telemetry
Figure: App interface showing prompt input and generated answers using the local LLM.
This app supports both Qwen2.5-0.5B-Instruct and Qwen3-0.6B — optimized for instruction-following, QA, and reasoning tasks.
Download the model.onnx
and tokenizer.json
from Hugging Face:
pip install optimum[onnxruntime]
# or
python -m pip install git+https://github.com/huggingface/optimum.git
Export the model:
optimum-cli export onnx --model Qwen/Qwen2.5-0.5B-Instruct qwen2.5-0.5B-onnx/
- You can also convert any fine-tuned variant by specifying the model path.
- Learn more about Optimum here.
- Android Studio
- ONNX Runtime for Android (already included in this repo)
- A physical Android device for deployment and testing
- Open Android Studio and create a new project (Empty Activity).
- Name your app
local_llm
. - Copy all the project files from this repo into the appropriate folders.
- Place your
model.onnx
andtokenizer.json
in:app/src/main/assets/
- Connect your Android phone using wireless debugging or USB.
- To install:
- Press Run
▶️ in Android Studio, or - Go to Build → Generate Signed Bundle / APK to export the
.apk
file.
- Press Run
-
➡️ pocket_llm_qwen2.5_0.5B_v1.1.0.apk
- Full precision (FP32). Best for high-end devices. Improved inference performance.
-
➡️ pocket_llm_qwen2.5_0.5B_fp16_v1.1.0.apk
- Half-precision (FP16). Great balance of speed and accuracy for most devices.
-
➡️ pocket_llm_qwen2.5_0.5B_q4fp16_v1.1.0.apk
- Quantized Q4 + FP16. Fastest and lightest version of Qwen2.5.
-
➡️ pocket_llm_qwen3_0.6B_fp16_v1.1.0.apk
- 🔥 New! Qwen3-0.6B with improved reasoning and Thinking Mode support.
-
➡️ pocket_llm_qwen3_0.6B_q4fp16_v1.1.0.apk
- 🔥 New! Qwen3 quantized version (Q4 + FP16). Compact and fast with Thinking Mode.
This app performs all inference locally on your device. No data is sent to any server, ensuring full privacy and low latency.
- 🧠 Qwen3-0.6B — Added Qwen3 model support.
- 🔁 Chat Memory — Add multi-turn conversation with context retention.
- 🦙 LLaMA 3 1B — Support Meta’s new compact LLM.
MIT License — use freely, modify locally, and deploy offline. Contributions welcome!