Port of Ernie4 5 #348

smdesai · 2025-07-03T18:04:30Z

This is an MLX Swift port of @johnmai-dev port of Ernie. I'm unable to run the model using mlx-community/ERNIE-4.5-0.3B-PT-bf16 as it's missing tokenizer.json. I've created a copy of the model in smdesai/ERNIE-4.5-0.3B-PT-bf16 which contains the missing tokenizer.json file.

johnmai-dev · 2025-07-04T00:28:41Z

Hi! Thanks for creating the updated model copy — really appreciate it!

Could I ask how you managed to add the missing tokenizer.json file to smdesai/ERNIE-4.5-0.3B-PT-bf16?

Would love to learn from your process — thanks in advance!

johnmai-dev · 2025-07-04T04:34:17Z

Applications/LLMEval/ContentView.swift

@@ -176,7 +176,7 @@ class LLMEvaluator {

    /// This controls which model loads. `qwen2_5_1_5b` is one of the smaller ones, so this will fit on
    /// more devices.
-    let modelConfiguration = LLMRegistry.qwen3_1_7b_4bit
+    let modelConfiguration = LLMRegistry.ernie4503BPTbf16


It is recommended not to modify here.
#302 (comment)

@johnmai-dev Thanks for heads up on the recommendation, I've reverted it. As for generating tokenizer.json, this is the python script I used. I had downloaded the model prior so I used the last example.

from transformers import AutoTokenizer import json import os def convert_tokenizer_model_to_json(model_path, output_path=None): """ Convert a tokenizer.model file to tokenizer.json format. Args: model_path: Path to the tokenizer.model file or directory containing it output_path: Optional output path for tokenizer.json (defaults to same directory) """ # Handle both file and directory paths if os.path.isdir(model_path): tokenizer_model_path = os.path.join(model_path, "tokenizer.model") else: tokenizer_model_path = model_path model_path = os.path.dirname(model_path) if not os.path.exists(tokenizer_model_path): raise FileNotFoundError(f"tokenizer.model not found at {tokenizer_model_path}") tokenizer = AutoTokenizer.from_pretrained(model_path) if output_path is None: output_path = model_path tokenizer.save_pretrained(output_path) tokenizer_json_path = os.path.join(output_path, "tokenizer.json") if os.path.exists(tokenizer_json_path): print(f"Successfully created tokenizer.json at {tokenizer_json_path}") else: print("Warning: tokenizer.json was not created. The tokenizer might not support this format.") return tokenizer_json_path # Example usage if __name__ == "__main__": # Example: Convert a tokenizer.model file # convert_tokenizer_model_to_json("/path/to/tokenizer.model") # Example: Convert from a directory containing tokenizer.model # convert_tokenizer_model_to_json("/path/to/model/directory")

Thank you very much! @smdesai

Hey @johnmai-dev I see the ERNIE model in mlx-community was added by you. Any chance you can add tokenizer.json to the models here? I can then change LLMModelFactory to reference the model in mlx-community

I used the python script you provided, but it didn't generate tokenizer.json. Maybe I need to configure something else?

Sorry, try this: https://colab.research.google.com/drive/1B9v_838cTn0KavQ26uWuFcRKJ7jrjVQb?usp=sharing

Where can I download ERNIE-4.5-0.3B-PT-bf16 from this notebook?

The files are identical to the ones here: https://huggingface.co/mlx-community/ERNIE-4.5-0.3B-PT-bf16

Unfortunately, it still cannot be generated.
I have added huggingface-cli download and pip install commands in your notebook.

Can you try running it with the notebook I provided?
Let me see if there will be any difference in the results when you run it with my notebook.
https://colab.research.google.com/drive/1fAHK6EL8JYsHDo5duJr5llI1YeeK974t?usp=sharing

Ok I have no idea what's going on here. I tried your notebook and I get the same error as you. I also tried the same changes in my notebook and get the same error (not surprising). So the only thing that works is:

downloading the model files via huggingface to a local directory and converting

uploading the model files to Colab and performing the conversion

johnmai-dev · 2025-07-08T17:21:42Z

The tokenizer_config.json you are using is inconsistent with Baidu's. Since you are using LlamaTokenizer, your model can generate tokenizer.json.

https://huggingface.co/smdesai/ERNIE-4.5-0.3B-PT-bf16/blob/main/tokenizer_config.json#L9242

https://huggingface.co/baidu/ERNIE-4.5-0.3B-PT/blob/main/tokenizer_config.json#L14
https://huggingface.co/mlx-community/ERNIE-4.5-0.3B-PT-bf16/blob/main/tokenizer_config.json#L9155

smdesai · 2025-07-08T18:21:10Z

Thanks for tracking it down. It seems that when I was initially trying to convert, I did use LlamaTokenizer which reported an error but may have incorrectly modified the tokenizer_config.json. I then switched to AutoTokenizer which performed the conversion (incorrectly). I'm going to look at using tokenization_ernie4_5.py to generate tokenizer.json.

smdesai · 2025-07-08T21:38:17Z

@johnmai-dev Try this. This keeps the rest of the model files intact creating only tokenizer.json. Running in colab, use this for main(). I also made changes to Tokenizer.swift in MLXCommon to support the T5Tokenizer (Unigram) for Ernie.

def main():
    convert_sentencepiece_to_tokenizer_json("models/ERNIE-4.5-0.3B-PT-bf16/tokenizer.model", "models/ERNIE-4.5-0.3B-PT-bf16/tokenizer.json")

convert_tokenizer.py.txt

Port of Ernie4 5

f3e2be1

johnmai-dev reviewed Jul 4, 2025

View reviewed changes

cleanup, revert to original mode and run pre-commit

70361d6

add Ernie tokenizer and use AutoTokenizer instead of PreTrainedTokenizer

5e12ad6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port of Ernie4 5 #348

Port of Ernie4 5 #348

Uh oh!

smdesai commented Jul 3, 2025

Uh oh!

johnmai-dev commented Jul 4, 2025

Uh oh!

johnmai-dev Jul 4, 2025

Uh oh!

smdesai Jul 4, 2025

Uh oh!

johnmai-dev Jul 5, 2025

Uh oh!

smdesai Jul 6, 2025

Uh oh!

johnmai-dev Jul 6, 2025

Uh oh!

smdesai Jul 8, 2025

Uh oh!

johnmai-dev Jul 8, 2025

Uh oh!

smdesai Jul 8, 2025

Uh oh!

johnmai-dev Jul 8, 2025

Uh oh!

smdesai Jul 8, 2025

Uh oh!

johnmai-dev commented Jul 8, 2025

Uh oh!

smdesai commented Jul 8, 2025

Uh oh!

smdesai commented Jul 8, 2025

Uh oh!

Uh oh!

Port of Ernie4 5 #348

Are you sure you want to change the base?

Port of Ernie4 5 #348

Uh oh!

Conversation

smdesai commented Jul 3, 2025

Uh oh!

johnmai-dev commented Jul 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnmai-dev commented Jul 8, 2025

Uh oh!

smdesai commented Jul 8, 2025

Uh oh!

smdesai commented Jul 8, 2025

Uh oh!

Uh oh!