Skip to content

feat: implement gemma3n text model in MLXLLM #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

xlab
Copy link

@xlab xlab commented Jul 2, 2025

Implementation of Gemma 3n model for MLXLLM, text only. Based on the reference implementation in mlx-lm:
ml-explore/mlx-lm#258

This code can actually help building the VLM version there #340
cc @DePasqualeOrg

Models

The original MLX weights from mlx-vlm are not supported, only weights converted by mlx-lm are supported.

I've made a new collection with Text Only MLX models, i.e. bf16 and 4bit quantized using this new support.
https://huggingface.co/collections/mlx-community/gemma-3n-text-only-lm-6861cf66ddc9a13102996308

Naive benchmarks

Apple M4 Max

Model Peak Memory Generation Speed Generation Time
mlx-community/gemma-3n-E4B-it-lm-bf16 13097M 39.983666 tokens/s 7.503064s
mlx-community/gemma-3n-E2B-it-lm-bf16 8535M 63.440184 tokens/s 4.728864s
mlx-community/gemma-3n-E4B-it-lm-4bit 3684M 81.048619 tokens/s 3.701482s
mlx-community/gemma-3n-E2B-it-lm-4bit 2391M 113.438366 tokens/s 2.644608s

iPhone 16 Pro

Model Generation Speed
mlx-community/gemma-3n-E4B-it-lm-4bit 10-15 tokens/s
mlx-community/gemma-3n-E2B-it-lm-4bit 25-30 tokens/s

Notes

  • Some operations can be compiled (e.g. gelu_topk, logit_softcap) to improve performane
  • RMSNoScale can be improved when MLXFast.rmsNorm is fixed (allows nil weights)

Misc

  • added to LLMModelFactory
  • added to MLXService
  • added to MLXChatExample
  • 4 model references from HF

Demos

ios-15toks
macos-60toks

xlab and others added 2 commits July 2, 2025 04:01
* added to LLMModelFactory
* added to MLXService
* added to MLXChatExample
* 4 model references from HF
@DePasqualeOrg
Copy link
Contributor

Nice! Did you base this on #340, or did you start from scratch based on the Python implementation?

@xlab
Copy link
Author

xlab commented Jul 2, 2025

Hey, good question. This implementation is like 3rd attempt and it is made from scratch based on the python source from mlx-lm.

#340 was a great inspiration, since I am new to this, but sometimes it was misleading. Also, in my initial attempt I was using mlx-vlm language but it also wasn't a good reference. It all worked out once mlx-lm reference was ready.

The key to a successful transpilation is to prompt it piece by piece and verify, also feeding this at the end a re-verifying the whole thing https://swiftpackageindex.com/ml-explore/mlx-swift/main/documentation/mlx/converting-python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants