Skip to content

Commit 1f098ca

Browse files
committed
Build wheels; update codebase
1 parent a1a9f16 commit 1f098ca

File tree

9 files changed

+122
-4
lines changed

9 files changed

+122
-4
lines changed

.github/workflows/build_wheels.yml

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
name: Build wheels
2+
3+
on: [push, pull_request, workflow_dispatch]
4+
5+
concurrency:
6+
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
7+
cancel-in-progress: true
8+
9+
jobs:
10+
build_wheels:
11+
name: py${{ matrix.python-version }} on ${{ matrix.os }}
12+
runs-on: ${{ matrix.os }}
13+
strategy:
14+
fail-fast: false
15+
matrix:
16+
# cibuildwheel builds linux wheels inside a manylinux container
17+
# it also takes care of procuring the correct python version for us
18+
os: [ubuntu-latest, windows-latest, macos-latest]
19+
python-version: [39, 310, 311]
20+
21+
steps:
22+
- uses: actions/checkout@v3
23+
24+
- uses: pypa/cibuildwheel@v2.11.3
25+
env:
26+
CIBW_BUILD: "cp${{ matrix.python-version}}-*"
27+
28+
- uses: actions/upload-artifact@v3
29+
with:
30+
name: dist
31+
path: ./wheelhouse/*.whl
32+
33+
build_sdist:
34+
name: sdist
35+
runs-on: ubuntu-latest
36+
steps:
37+
- uses: actions/checkout@v3
38+
- uses: actions/setup-python@v4
39+
name: Install Python
40+
with:
41+
python-version: "3.9"
42+
- name: Run check-manifest
43+
run: |
44+
pip install check-manifest
45+
check-manifest -v
46+
- name: Build sdist
47+
run: |
48+
pip install --upgrade build
49+
python -m build --sdist
50+
- uses: actions/upload-artifact@v3
51+
with:
52+
name: dist
53+
path: ./dist/*.tar.gz

MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@ include *.svg
22
include *.toml
33
include Makefile
44
recursive-include scripts *.py
5+
recursive-include tests *.py
56
recursive-include src *.rs

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The tokeniser API is documented in `tiktoken/core.py`.
1818

1919
## Performance
2020

21-
`tiktoken` is between 3-6x faster than huggingface's tokeniser:
21+
`tiktoken` is between 3-6x faster than a comparable open source tokeniser:
2222

2323
![image](./perf.svg)
2424

perf.svg

Lines changed: 1 addition & 0 deletions
Loading

pyproject.toml

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,30 @@
22
name = "tiktoken"
33
dependencies = ["blobfile>=2", "regex>=2022.1.18"]
44
dynamic = ["version"]
5+
requires-python = ">=3.9"
56

67
[build-system]
7-
requires = ["setuptools", "wheel", "setuptools-rust"]
8+
build-backend = "setuptools.build_meta"
9+
requires = ["setuptools>=61", "wheel", "setuptools-rust>=1.3"]
10+
11+
[tool.cibuildwheel]
12+
build-frontend = "build"
13+
build-verbosity = 1
14+
15+
linux.before-all = "curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y"
16+
linux.environment = { PATH = "$PATH:$HOME/.cargo/bin" }
17+
macos.before-all = "rustup target add aarch64-apple-darwin"
18+
19+
skip = [
20+
"*-manylinux_i686",
21+
"*-musllinux_i686",
22+
"*-win32",
23+
]
24+
macos.archs = ["x86_64", "arm64"]
25+
# When cross-compiling on Intel, it is not possible to test arm64 wheels.
26+
# Warnings will be silenced with following CIBW_TEST_SKIP
27+
test-skip = "*-macosx_arm64"
28+
29+
before-test = "pip install pytest"
30+
test-command = "pytest {project}/tests"
831

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
public = True
55

66
if public:
7-
version = "0.1"
7+
version = "0.1.1"
88

99
setup(
1010
name="tiktoken",

tests/test_simple_public.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
import tiktoken
2+
3+
4+
def test_simple():
5+
enc = tiktoken.get_encoding("gpt2")
6+
assert enc.encode("hello world") == [31373, 995]
7+
assert enc.decode([31373, 995]) == "hello world"
8+
9+
enc = tiktoken.get_encoding("cl100k_base")
10+
assert enc.encode("hello world") == [15339, 1917]
11+
assert enc.decode([15339, 1917]) == "hello world"

tiktoken/core.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,8 @@ def encode_with_unstable(
153153
154154
See `encode` for more details on `allowed_special` and `disallowed_special`.
155155
156+
This API should itself be considered unstable.
157+
156158
```
157159
>>> enc.encode_with_unstable("hello fanta")
158160
([31373], [(277, 4910), (5113, 265), ..., (8842,)])

tiktoken_ext/openai_public.py

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,28 @@ def gpt2():
2121
}
2222

2323

24+
def r50k_base():
25+
mergeable_ranks = load_tiktoken_bpe("az://openaipublic/encodings/r50k_base.tiktoken")
26+
return {
27+
"name": "r50k_base",
28+
"explicit_n_vocab": 50257,
29+
"pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
30+
"mergeable_ranks": mergeable_ranks,
31+
"special_tokens": {ENDOFTEXT: 50256},
32+
}
33+
34+
35+
def p50k_base():
36+
mergeable_ranks = load_tiktoken_bpe("az://openaipublic/encodings/p50k_base.tiktoken")
37+
return {
38+
"name": "p50k_base",
39+
"explicit_n_vocab": 50281,
40+
"pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
41+
"mergeable_ranks": mergeable_ranks,
42+
"special_tokens": {ENDOFTEXT: 50256},
43+
}
44+
45+
2446
def cl100k_base():
2547
mergeable_ranks = load_tiktoken_bpe("az://openaipublic/encodings/cl100k_base.tiktoken")
2648
special_tokens = {
@@ -38,4 +60,9 @@ def cl100k_base():
3860
}
3961

4062

41-
ENCODING_CONSTRUCTORS = {"gpt2": gpt2, "cl100k_base": cl100k_base}
63+
ENCODING_CONSTRUCTORS = {
64+
"gpt2": gpt2,
65+
"r50k_base": r50k_base,
66+
"p50k_base": p50k_base,
67+
"cl100k_base": cl100k_base,
68+
}

0 commit comments

Comments
 (0)