A playground to understand BPE (Byte-Pair Encoding) Tokenization
- https://en.wikipedia.org/wiki/Byte-pair_encoding
- https://huggingface.co/learn/llm-course/chapter6/5
- Tokenizers are based on Andrej Karpathy - Let's build the GPT Tokenizer https://www.youtube.com/watch?v=zduSFxRajkE&t=350s
-
Create virtual environment:
python -m venv venv
-
Activate virtual environment:
source venv/bin/activate
-
Install python dependencies:
pip install -r requirements.txt
python scripts/tokenizer_tester.py
streamlit run scripts/tokenizer_viewer.py
Go to http://localhost:8501 in your browser.