LLM model inference on Apple Silicon Mac using the Apple MLX Framework.
- Apple MacBook Pro (13-inch, M2, 2022)
- Apple M2 chips (8 cores CPU, 10 cores GPU)
- 16GB RAM, 256GB SSD
- macOS Sequoia 15.3.1
- Python 3.10.16
- mlx-lm 0.21.4
python3.10 -m venv .venv
source .venv/bin/activate
pip install -U pip setuptools pip-autoremove
pip install -r requirements.txt
Args | Type | Required | Default | Description |
---|---|---|---|---|
--repo_id |
str |
Required | Path or Hugging Face Repository ID | |
--token |
str |
Optional | Hugging Face API Token | |
--cache_dir |
str |
Optional | ~/.cache/huggingface/hub |
Cache directory for the model |
source .venv/bin/activate
# Download the model from Hugging Face Hub
python model_download.py --repo_id "mlx-community/gemma-2-9b-it-4bit"
# Download the model from Hugging Face Hub with custom cache directory
python model_download.py --repo_id "mlx-community/gemma-2-9b-it-4bit" --cache_dir "/tmp/huggingface/hub"
# Download the model from Hugging Face Hub with custom hugging face token
python model_download.py --repo_id "mlx-community/gemma-2-9b-it-4bit" --token "YOUR_HUGGING_FACE_API_TOKEN"
Args | Type | Required | Default | Description |
---|---|---|---|---|
-m , --model |
str |
Required | Path to the model | |
--prompt |
str |
Required | Prompt for the LLM model | |
--max_tokens |
int |
Optional | 512 |
Maximum tokens to generate |
--verbose |
bool |
Optional | Verbose mode |
source .venv/bin/activate
# Run the stream inference with default values
python inference.py
# Run the stream inference with verbose mode
python inference.py --verbose
# Run the stream inference with custom model
python inference.py --model "mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx"
# Run the stream inference with custom prompt
python inference.py --prompt "What is the capital of France?"
# Run the stream inference with custom max tokens
python inference.py --max_tokens 1024
Args | Type | Required | Default | Description |
---|---|---|---|---|
-m , --model |
str |
Required | Path to the model | |
--quantize |
bool |
Required | Whether Quantize model | |
--quantize_level |
int |
Optional | 4 | Quantize level (bits) |
--verbose |
bool |
Optional | Verbose mode |
source .venv/bin/activate
# Convert Hugging Face model to MLX model format
python convert.py --model "google/gemma-2-9b-it"
# Convert Hugging Face model to MLX model format with verbose mode
python convert.py --model "google/gemma-2-9b-it" --verbose
# Convert Hugging Face model to MLX model format with quantization
python convert.py --model "google/gemma-2-9b-it" --quantize
# Convert Hugging Face model to MLX model format with quantization and custom quantize level
python convert.py --model "google/gemma-2-9b-it" --quantize --quantize_level 8