A C# implementation for OpenAI language models encoding
-
It seems that the BPE file contains a
Mapofbase64strings tointegerpairs. -
The regex doesn't seem to translate well. There is something that breaks, when trying to encode something directly in rust (see in the fork this specific test)
-
Currently the regex is not separating the words nor is it encoding them base64 individually, and that seems to be the problem
- Focus on the encoders' regex for
tiktoken. - Write the regex correctly in a C# compliant way.
- Test with an external script. Meaning, table test different inputs and assert over the output of
tiktokenvsHaToken
-
Python debugging is easy from VSCode. Just remember to edit the
launch.jsonfile so that"justMyCode": falsethen you will be able to dive into your localpipinstallation oftiktoken -
In order to do that either press F11 a lot, or open the files in VSCode for
tiktokenand set your breakpoints there. Their locations should be something like~/.local/lib/python3.8/site-packages/tiktoken/core.py -
You will have to install
tiktokenwithpiplike so:
pip3 install tiktoken- Example
playground.pyfile:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
result = enc.encode("Hello world")
print(result)