Skip to content

1attila/Tokenization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

Tokenization

A small package with bpe-tokenization utilities for LLMs inspired by Andrej Karpaty's minbpe repositoy.

Utilities

Regex tokenizer

  • GPT4 split pattner
  • Encoding/decoding methods with custom special tokens dict
  • Trainable, even from a precedent tokenizer (useful if you want to train on multiple corpus)
  • Optimizable to eliminate unused tokens/merges
  • Maps used tokens to reduce LLM vocab size
  • Supports save/loading from folder

Utilities

  • Count number of tokens
  • Check if special tokens are present
  • Train/optimize/vocab on mutiple corpus

Todo

  • Rewrite everything in C++ to make it faster (already working on it but it requires time)
  • Add other utilities (like .vocab file)

About

A small package with bpe-tokenization utilities for LLMs inspired by Andrej Karpaty's minbpe repository (https://github.com/karpathy/minbpe)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages