Skip to content

Commit 7830ed5

Browse files
committed
Bump version, sync codebase
1 parent 156eff9 commit 7830ed5

File tree

12 files changed

+175
-60
lines changed

12 files changed

+175
-60
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
This is the changelog for the open source version of tiktoken.
44

5+
## [v0.2.0]
6+
- Add ``tiktoken.encoding_for_model`` to get the encoding for a specific model
7+
- Improve portability of caching logic
8+
59
## [v0.1.2]
610
- Avoid use of `blobfile` for public files
711
- Add support for Python 3.8

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "tiktoken"
3-
version = "0.1.0"
3+
version = "0.2.0"
44
edition = "2021"
55
rust-version = "1.57.0"
66

Makefile

Lines changed: 0 additions & 49 deletions
This file was deleted.

README.md

Lines changed: 75 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ OpenAI's models.
77
import tiktoken
88
enc = tiktoken.get_encoding("gpt2")
99
assert enc.decode(enc.encode("hello world")) == "hello world"
10+
11+
# To get the tokeniser corresponding to a specific model in the OpenAI API:
12+
enc = tiktoken.encoding_for_model("text-davinci-003")
1013
```
1114

1215
The open source version of `tiktoken` can be installed from PyPI:
@@ -16,7 +19,9 @@ pip install tiktoken
1619

1720
The tokeniser API is documented in `tiktoken/core.py`.
1821

19-
Example code using `tiktoken` can be found in the [OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).
22+
Example code using `tiktoken` can be found in the
23+
[OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).
24+
2025

2126
## Performance
2227

@@ -28,3 +33,72 @@ Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2Tokeni
2833
`tokenizers==0.13.2` and `transformers==4.24.0`.
2934

3035

36+
## Getting help
37+
38+
Please post questions in the [issue tracker](https://github.com/openai/tiktoken/issues).
39+
40+
If you work at OpenAI, make sure to check the internal documentation or feel free to contact
41+
@shantanu.
42+
43+
44+
## Extending tiktoken
45+
46+
You may wish to extend `tiktoken` to support new encodings. There are two ways to do this.
47+
48+
49+
**Create your `Encoding` object exactly the way you want and simply pass it around.**
50+
51+
```python
52+
cl100k_base = tiktoken.get_encoding("cl100k_base")
53+
54+
# In production, load the arguments directly instead of accessing private attributes
55+
# See openai_public.py for examples of arguments for specific encodings
56+
enc = tiktoken.Encoding(
57+
# If you're changing the set of special tokens, make sure to use a different name
58+
# It should be clear from the name what behaviour to expect.
59+
name="cl100k_im",
60+
pat_str=cl100k_base._pat_str,
61+
mergeable_ranks=cl100k_base._mergeable_ranks,
62+
special_tokens={
63+
**cl100k_base._special_tokens,
64+
"<|im_start|>": 100264,
65+
"<|im_end|>": 100265,
66+
}
67+
)
68+
```
69+
70+
**Use the `tiktoken_ext` plugin mechanism to register your `Encoding` objects with `tiktoken`.**
71+
72+
This is only useful if you need `tiktoken.get_encoding` to find your encoding, otherwise prefer
73+
option 1.
74+
75+
To do this, you'll need to create a namespace package under `tiktoken_ext`.
76+
77+
Layout your project like this, making sure to omit the `tiktoken_ext/__init__.py` file:
78+
```
79+
my_tiktoken_extension
80+
├── tiktoken_ext
81+
│   └── my_encodings.py
82+
└── setup.py
83+
```
84+
85+
`my_encodings.py` should be a module that contains a variable named `ENCODING_CONSTRUCTORS`.
86+
This is a dictionary from an encoding name to a function that takes no arguments and returns
87+
arguments that can be passed to `tiktoken.Encoding` to construct that encoding. For an example, see
88+
`tiktoken_ext/openai_public.py`. For precise details, see `tiktoken/registry.py`.
89+
90+
Your `setup.py` should look something like this:
91+
```python
92+
from setuptools import setup, find_namespace_packages
93+
94+
setup(
95+
name="my_tiktoken_extension",
96+
packages=find_namespace_packages(include=['tiktoken_ext.*'])
97+
install_requires=["tiktoken"],
98+
...
99+
)
100+
```
101+
102+
Then simply `pip install my_tiktoken_extension` and you should be able to use your custom encodings!
103+
Make sure **not** to use an editable install.
104+

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
[project]
22
name = "tiktoken"
3+
version = "0.2.0"
34
dependencies = ["blobfile>=2", "regex>=2022.1.18", "requests>=2.26.0"]
4-
dynamic = ["version"]
55
requires-python = ">=3.8"
66

77
[build-system]
88
build-backend = "setuptools.build_meta"
9-
requires = ["setuptools>=61", "wheel", "setuptools-rust>=1.3"]
9+
requires = ["setuptools>=62.4", "wheel", "setuptools-rust>=1.5.2"]
1010

1111
[tool.cibuildwheel]
1212
build-frontend = "build"

setup.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,8 @@
11
from setuptools import setup
22
from setuptools_rust import Binding, RustExtension
33

4-
public = True
5-
6-
if public:
7-
version = "0.1.2"
8-
94
setup(
105
name="tiktoken",
11-
version=version,
126
rust_extensions=[
137
RustExtension(
148
"tiktoken._tiktoken",

tests/test_simple_public.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,10 @@ def test_simple():
1717
enc = tiktoken.get_encoding(enc_name)
1818
for token in range(10_000):
1919
assert enc.encode_single_token(enc.decode_single_token_bytes(token)) == token
20+
21+
22+
def test_encoding_for_model():
23+
enc = tiktoken.encoding_for_model("gpt2")
24+
assert enc.name == "gpt2"
25+
enc = tiktoken.encoding_for_model("text-davinci-003")
26+
assert enc.name == "p50k_base"

tiktoken/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
from .core import Encoding as Encoding
2+
from .model import encoding_for_model as encoding_for_model
23
from .registry import get_encoding as get_encoding
34
from .registry import list_encoding_names as list_encoding_names

tiktoken/core.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,21 @@ def __init__(
1919
special_tokens: dict[str, int],
2020
explicit_n_vocab: Optional[int] = None,
2121
):
22+
"""Creates an Encoding object.
23+
24+
See openai_public.py for examples of how to construct an Encoding object.
25+
26+
Args:
27+
name: The name of the encoding. It should be clear from the name of the encoding
28+
what behaviour to expect, in particular, encodings with different special tokens
29+
should have different names.
30+
pat_str: A regex pattern string that is used to split the input text.
31+
mergeable_ranks: A dictionary mapping mergeable token bytes to their ranks. The ranks
32+
must correspond to merge priority.
33+
special_tokens: A dictionary mapping special token strings to their token values.
34+
explicit_n_vocab: The number of tokens in the vocabulary. If provided, it is checked
35+
that the number of mergeable tokens and special tokens is equal to this number.
36+
"""
2237
self.name = name
2338

2439
self._pat_str = pat_str

tiktoken/load.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import hashlib
55
import json
66
import os
7+
import tempfile
78
import uuid
89

910
import blobfile
@@ -24,7 +25,7 @@ def read_file_cached(blobpath: str) -> bytes:
2425
elif "DATA_GYM_CACHE_DIR" in os.environ:
2526
cache_dir = os.environ["DATA_GYM_CACHE_DIR"]
2627
else:
27-
cache_dir = "/tmp/data-gym-cache"
28+
cache_dir = os.path.join(tempfile.gettempdir(), "data-gym-cache")
2829

2930
if cache_dir == "":
3031
# disable caching

0 commit comments

Comments
 (0)