Skip to content

Port of Ernie4 5 #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Port of Ernie4 5 #348

wants to merge 3 commits into from

Conversation

smdesai
Copy link
Contributor

@smdesai smdesai commented Jul 3, 2025

This is an MLX Swift port of @johnmai-dev port of Ernie. I'm unable to run the model using mlx-community/ERNIE-4.5-0.3B-PT-bf16 as it's missing tokenizer.json. I've created a copy of the model in smdesai/ERNIE-4.5-0.3B-PT-bf16 which contains the missing tokenizer.json file.

@johnmai-dev
Copy link
Contributor

Hi! Thanks for creating the updated model copy — really appreciate it!

Could I ask how you managed to add the missing tokenizer.json file to smdesai/ERNIE-4.5-0.3B-PT-bf16?

Would love to learn from your process — thanks in advance!

@@ -176,7 +176,7 @@ class LLMEvaluator {

/// This controls which model loads. `qwen2_5_1_5b` is one of the smaller ones, so this will fit on
/// more devices.
let modelConfiguration = LLMRegistry.qwen3_1_7b_4bit
let modelConfiguration = LLMRegistry.ernie4503BPTbf16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended not to modify here.
#302 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmai-dev Thanks for heads up on the recommendation, I've reverted it. As for generating tokenizer.json, this is the python script I used. I had downloaded the model prior so I used the last example.

from transformers import AutoTokenizer
import json
import os

def convert_tokenizer_model_to_json(model_path, output_path=None):
    """
    Convert a tokenizer.model file to tokenizer.json format.

    Args:
        model_path: Path to the tokenizer.model file or directory containing it
        output_path: Optional output path for tokenizer.json (defaults to same directory)
    """
    # Handle both file and directory paths
    if os.path.isdir(model_path):
        tokenizer_model_path = os.path.join(model_path, "tokenizer.model")
    else:
        tokenizer_model_path = model_path
        model_path = os.path.dirname(model_path)

    if not os.path.exists(tokenizer_model_path):
        raise FileNotFoundError(f"tokenizer.model not found at {tokenizer_model_path}")

    tokenizer = AutoTokenizer.from_pretrained(model_path)

    if output_path is None:
        output_path = model_path

    tokenizer.save_pretrained(output_path)

    tokenizer_json_path = os.path.join(output_path, "tokenizer.json")
    if os.path.exists(tokenizer_json_path):
        print(f"Successfully created tokenizer.json at {tokenizer_json_path}")
    else:
        print("Warning: tokenizer.json was not created. The tokenizer might not support this format.")

    return tokenizer_json_path

# Example usage
if __name__ == "__main__":
    # Example: Convert a tokenizer.model file
    # convert_tokenizer_model_to_json("/path/to/tokenizer.model")

    # Example: Convert from a directory containing tokenizer.model
    # convert_tokenizer_model_to_json("/path/to/model/directory")
    

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! @smdesai

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @johnmai-dev I see the ERNIE model in mlx-community was added by you. Any chance you can add tokenizer.json to the models here? I can then change LLMModelFactory to reference the model in mlx-community

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the python script you provided, but it didn't generate tokenizer.json. Maybe I need to configure something else?
访达 2025-07-06 13 58 18

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I download ERNIE-4.5-0.3B-PT-bf16 from this notebook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The files are identical to the ones here: https://huggingface.co/mlx-community/ERNIE-4.5-0.3B-PT-bf16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it still cannot be generated.
I have added huggingface-cli download and pip install commands in your notebook.

Can you try running it with the notebook I provided?
Let me see if there will be any difference in the results when you run it with my notebook.
https://colab.research.google.com/drive/1fAHK6EL8JYsHDo5duJr5llI1YeeK974t?usp=sharing

Google Chrome 2025-07-09 00 25 40 image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I have no idea what's going on here. I tried your notebook and I get the same error as you. I also tried the same changes in my notebook and get the same error (not surprising). So the only thing that works is:

  • downloading the model files via huggingface to a local directory and converting
  • uploading the model files to Colab and performing the conversion

@johnmai-dev
Copy link
Contributor

@smdesai
Copy link
Contributor Author

smdesai commented Jul 8, 2025

Thanks for tracking it down. It seems that when I was initially trying to convert, I did use LlamaTokenizer which reported an error but may have incorrectly modified the tokenizer_config.json. I then switched to AutoTokenizer which performed the conversion (incorrectly). I'm going to look at using tokenization_ernie4_5.py to generate tokenizer.json.

@smdesai
Copy link
Contributor Author

smdesai commented Jul 8, 2025

@johnmai-dev Try this. This keeps the rest of the model files intact creating only tokenizer.json. Running in colab, use this for main(). I also made changes to Tokenizer.swift in MLXCommon to support the T5Tokenizer (Unigram) for Ernie.

def main():
    convert_sentencepiece_to_tokenizer_json("models/ERNIE-4.5-0.3B-PT-bf16/tokenizer.model", "models/ERNIE-4.5-0.3B-PT-bf16/tokenizer.json")

convert_tokenizer.py.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants