-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the issue as clearly as possible:
The ernie tokenizer's vocabulary cannot be reduced.
Steps/code to reproduce the bug:
from outlines_core.fsm.regex import reduced_vocabulary
from outlines.models.transformers import TransformerTokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baidu/ERNIE-4.5-0.3B-PT")
vocabulary = reduced_vocabulary(TransformerTokenizer(tokenizer))
Expected result:
No error expected
Error message:
Traceback (most recent call last):
File "/Users/neil/workspace/amphibian-apps/apps/mlx-engine/../../outlines_test.py", line 6, in <module>
vocabulary = reduced_vocabulary(TransformerTokenizer(tokenizer))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/neil/workspace/amphibian-apps/apps/mlx-engine/.venv/lib/python3.11/site-packages/outlines_core/fsm/regex.py", line 426, in reduced_vocabulary
raise RuntimeError(
RuntimeError: Cannot convert token `�@` (36865) to bytes: �@
Outlines/Python version information:
Version information
```
python -c "from outlines import _version; print(_version.version)"
1.1.0
python -c "import sys; print('Python', sys.version)"
Python 3.11.11 (main, Dec 3 2024, 17:20:40) [Clang 16.0.0 (clang-1600.0.26.4)]
uv pip freeze
addict==2.4.0
aiofiles==24.1.0
aiohappyeyeballs==2.6.1
aiohttp==3.11.18
aioice==0.10.1
aiortc==1.13.0
aiosignal==1.3.2
airportsdata==20250622
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.9.0
attrs==25.3.0
audioread==3.0.1
av==14.4.0
babel==2.17.0
blis==1.3.0
brotli==1.1.0
catalogue==2.0.10
certifi==2025.6.15
cffi==1.17.1
charset-normalizer==3.4.2
click==8.2.1
cloudpathlib==0.21.1
cloudpickle==3.1.1
colorama==0.4.6
coloredlogs==15.0.1
confection==0.1.5
cryptography==45.0.5
csvw==3.5.1
curated-tokenizers==0.0.9
curated-transformers==0.1.1
cymem==2.0.11
dacite==1.9.2
datasets==4.0.0
decorator==5.2.1
dill==0.3.8
diskcache==5.6.3
dlinfo==2.0.0
dnspython==2.7.0
docopt==0.6.2
einops==0.8.1
einx==0.3.0
espeakng-loader==0.2.4
fastapi==0.115.14
fastrtc==0.0.29
fastrtc-moonshine-onnx==20241016
ffmpy==0.6.0
filelock==3.18.0
flatbuffers==25.2.10
frozendict==2.4.6
frozenlist==1.6.0
fsspec==2024.12.0
future==1.0.0
genson==1.3.0
google-crc32c==1.7.1
gradio==5.38.0
gradio-client==1.11.0
groovy==0.1.2
h11==0.16.0
hf-xet==1.1.5
httpcore==1.0.9
httpx==0.28.1
huggingface-hub==0.33.1
humanfriendly==10.0
idna==3.10
ifaddr==0.2.0
iniconfig==2.1.0
interegular==0.3.3
iso3166==2.1.1
isodate==0.7.2
jinja2==3.1.6
joblib==1.5.1
jsonpath-ng==1.7.0
jsonschema==4.24.0
jsonschema-specifications==2025.4.1
langcodes==3.5.0
language-data==1.3.0
language-tags==1.2.0
lark==1.2.2
lazy-loader==0.4
librosa==0.11.0
llvmlite==0.44.0
loguru==0.7.3
marisa-trie==1.2.1
markdown-it-py==3.0.0
markupsafe==2.1.5
mdurl==0.1.2
misaki==0.9.4
mlx==0.26.3
mlx-audio==0.2.3
mlx-lm==0.26.0
mlx-vlm @ git+https://github.com/neilmehta24/mlx-vlm.git@73523d6538ef31ee13d62ce5391e67d8754a93e8
mpmath==1.3.0
msgpack==1.1.1
multidict==6.4.3
multiprocess==0.70.16
murmurhash==1.0.13
nest-asyncio==1.6.0
networkx==3.4.2
num2words==0.5.14
numba==0.61.2
numpy==2.1.3
omegaconf==2.3.0
onnxruntime==1.22.1
opencv-python==4.10.0.84
orjson==3.11.0
outlines==1.1.0
outlines-core==0.1.26
packaging==25.0
pandas==2.3.1
phonemizer-fork==3.3.2
pillow==11.2.1
platformdirs==4.3.8
pluggy==1.6.0
ply==3.11
pooch==1.8.2
preshed==3.0.10
propcache==0.3.1
protobuf==6.31.1
pyarrow==21.0.0
pycparser==2.22
pydantic==2.11.7
pydantic-core==2.33.2
pydub==0.25.1
pyee==13.0.0
pygments==2.19.2
pylibsrtp==0.12.0
pyloudnorm==0.1.1
pyopenssl==25.1.0
pyparsing==3.2.3
pytest==8.4.1
python-dateutil==2.9.0.post0
python-multipart==0.0.20
pytz==2025.2
pyyaml==6.0.2
rdflib==7.1.4
referencing==0.36.2
regex==2024.11.6
requests==2.32.4
rfc3986==1.5.0
rich==14.0.0
rpds-py==0.25.1
ruff==0.12.4
safehttpx==0.1.6
safetensors==0.5.3
scikit-learn==1.7.1
scipy==1.16.0
segments==2.3.0
semantic-version==2.10.0
sentencepiece==0.2.0
setuptools==80.0.0
shellingham==1.5.4
six==1.17.0
smart-open==7.3.0.post1
sniffio==1.3.1
sounddevice==0.5.2
soundfile==0.13.1
soxr==0.5.0.post1
spacy==3.8.7
spacy-curated-transformers==0.3.1
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.5.1
starlette==0.46.2
sympy==1.14.0
thinc==8.3.6
threadpoolctl==3.6.0
tiktoken==0.9.0
timm==1.0.16
tokenizers==0.21.2
tomlkit==0.13.3
torch==2.7.0
torchvision==0.22.0
tqdm==4.67.1
transformers==4.53.0
typer==0.16.0
typing-extensions==4.13.2
typing-inspection==0.4.1
tzdata==2025.2
uritemplate==4.2.0
urllib3==2.5.0
uvicorn==0.35.0
wasabi==1.1.3
weasel==0.4.1
webrtcvad==2.0.10
websockets==15.0.1
wrapt==1.17.2
xxhash==3.5.0
yarl==1.20.0
```
Context for the issue:
Came across the issue when attempting to use outlines with a Ernie 4.5 model.
This is the temporary fix I came up with:
outlines_core.fsm.regex.re_replacement_seq = re.compile(r"^▁*\.*>*�+\.*s*@*(�@)*$")
But hopefully a fix can be made that can better handle occurrences of this character in tokenizers.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working