Skip to content

Failure to reduce Ernie 4.5 vocabulary #222

@neilmehta24

Description

@neilmehta24

Describe the issue as clearly as possible:

The ernie tokenizer's vocabulary cannot be reduced.

Steps/code to reproduce the bug:

from outlines_core.fsm.regex import reduced_vocabulary
from outlines.models.transformers import TransformerTokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baidu/ERNIE-4.5-0.3B-PT")
vocabulary = reduced_vocabulary(TransformerTokenizer(tokenizer))

Expected result:

No error expected

Error message:

Traceback (most recent call last):
  File "/Users/neil/workspace/amphibian-apps/apps/mlx-engine/../../outlines_test.py", line 6, in <module>
    vocabulary = reduced_vocabulary(TransformerTokenizer(tokenizer))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/workspace/amphibian-apps/apps/mlx-engine/.venv/lib/python3.11/site-packages/outlines_core/fsm/regex.py", line 426, in reduced_vocabulary
    raise RuntimeError(
RuntimeError: Cannot convert token `�@` (36865) to bytes: �@

Outlines/Python version information:

Version information

``` python -c "from outlines import _version; print(_version.version)" 1.1.0 python -c "import sys; print('Python', sys.version)" Python 3.11.11 (main, Dec 3 2024, 17:20:40) [Clang 16.0.0 (clang-1600.0.26.4)] uv pip freeze addict==2.4.0 aiofiles==24.1.0 aiohappyeyeballs==2.6.1 aiohttp==3.11.18 aioice==0.10.1 aiortc==1.13.0 aiosignal==1.3.2 airportsdata==20250622 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 anyio==4.9.0 attrs==25.3.0 audioread==3.0.1 av==14.4.0 babel==2.17.0 blis==1.3.0 brotli==1.1.0 catalogue==2.0.10 certifi==2025.6.15 cffi==1.17.1 charset-normalizer==3.4.2 click==8.2.1 cloudpathlib==0.21.1 cloudpickle==3.1.1 colorama==0.4.6 coloredlogs==15.0.1 confection==0.1.5 cryptography==45.0.5 csvw==3.5.1 curated-tokenizers==0.0.9 curated-transformers==0.1.1 cymem==2.0.11 dacite==1.9.2 datasets==4.0.0 decorator==5.2.1 dill==0.3.8 diskcache==5.6.3 dlinfo==2.0.0 dnspython==2.7.0 docopt==0.6.2 einops==0.8.1 einx==0.3.0 espeakng-loader==0.2.4 fastapi==0.115.14 fastrtc==0.0.29 fastrtc-moonshine-onnx==20241016 ffmpy==0.6.0 filelock==3.18.0 flatbuffers==25.2.10 frozendict==2.4.6 frozenlist==1.6.0 fsspec==2024.12.0 future==1.0.0 genson==1.3.0 google-crc32c==1.7.1 gradio==5.38.0 gradio-client==1.11.0 groovy==0.1.2 h11==0.16.0 hf-xet==1.1.5 httpcore==1.0.9 httpx==0.28.1 huggingface-hub==0.33.1 humanfriendly==10.0 idna==3.10 ifaddr==0.2.0 iniconfig==2.1.0 interegular==0.3.3 iso3166==2.1.1 isodate==0.7.2 jinja2==3.1.6 joblib==1.5.1 jsonpath-ng==1.7.0 jsonschema==4.24.0 jsonschema-specifications==2025.4.1 langcodes==3.5.0 language-data==1.3.0 language-tags==1.2.0 lark==1.2.2 lazy-loader==0.4 librosa==0.11.0 llvmlite==0.44.0 loguru==0.7.3 marisa-trie==1.2.1 markdown-it-py==3.0.0 markupsafe==2.1.5 mdurl==0.1.2 misaki==0.9.4 mlx==0.26.3 mlx-audio==0.2.3 mlx-lm==0.26.0 mlx-vlm @ git+https://github.com/neilmehta24/mlx-vlm.git@73523d6538ef31ee13d62ce5391e67d8754a93e8 mpmath==1.3.0 msgpack==1.1.1 multidict==6.4.3 multiprocess==0.70.16 murmurhash==1.0.13 nest-asyncio==1.6.0 networkx==3.4.2 num2words==0.5.14 numba==0.61.2 numpy==2.1.3 omegaconf==2.3.0 onnxruntime==1.22.1 opencv-python==4.10.0.84 orjson==3.11.0 outlines==1.1.0 outlines-core==0.1.26 packaging==25.0 pandas==2.3.1 phonemizer-fork==3.3.2 pillow==11.2.1 platformdirs==4.3.8 pluggy==1.6.0 ply==3.11 pooch==1.8.2 preshed==3.0.10 propcache==0.3.1 protobuf==6.31.1 pyarrow==21.0.0 pycparser==2.22 pydantic==2.11.7 pydantic-core==2.33.2 pydub==0.25.1 pyee==13.0.0 pygments==2.19.2 pylibsrtp==0.12.0 pyloudnorm==0.1.1 pyopenssl==25.1.0 pyparsing==3.2.3 pytest==8.4.1 python-dateutil==2.9.0.post0 python-multipart==0.0.20 pytz==2025.2 pyyaml==6.0.2 rdflib==7.1.4 referencing==0.36.2 regex==2024.11.6 requests==2.32.4 rfc3986==1.5.0 rich==14.0.0 rpds-py==0.25.1 ruff==0.12.4 safehttpx==0.1.6 safetensors==0.5.3 scikit-learn==1.7.1 scipy==1.16.0 segments==2.3.0 semantic-version==2.10.0 sentencepiece==0.2.0 setuptools==80.0.0 shellingham==1.5.4 six==1.17.0 smart-open==7.3.0.post1 sniffio==1.3.1 sounddevice==0.5.2 soundfile==0.13.1 soxr==0.5.0.post1 spacy==3.8.7 spacy-curated-transformers==0.3.1 spacy-legacy==3.0.12 spacy-loggers==1.0.5 srsly==2.5.1 starlette==0.46.2 sympy==1.14.0 thinc==8.3.6 threadpoolctl==3.6.0 tiktoken==0.9.0 timm==1.0.16 tokenizers==0.21.2 tomlkit==0.13.3 torch==2.7.0 torchvision==0.22.0 tqdm==4.67.1 transformers==4.53.0 typer==0.16.0 typing-extensions==4.13.2 typing-inspection==0.4.1 tzdata==2025.2 uritemplate==4.2.0 urllib3==2.5.0 uvicorn==0.35.0 wasabi==1.1.3 weasel==0.4.1 webrtcvad==2.0.10 websockets==15.0.1 wrapt==1.17.2 xxhash==3.5.0 yarl==1.20.0 ```

Context for the issue:

Came across the issue when attempting to use outlines with a Ernie 4.5 model.

This is the temporary fix I came up with:
outlines_core.fsm.regex.re_replacement_seq = re.compile(r"^▁*\.*>*�+\.*s*@*(�@)*$")

But hopefully a fix can be made that can better handle occurrences of this character in tokenizers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions