You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support for checking hash of downloaded files before use. (#230)
We are using tiktoken in various production scenarios and sometimes have
the problem that the download of `.tiktoken` files (e.g.,
`cl100k_base.tiktoken`) will get interrupted or fail, causing the cached
file to be corrupted in some way. In those cases, the results returned
from the encoder will be incorrect and could be damaging to our
production instances.
More often, when this happens, `Encoder.encode()` will throw an
exception such as
```
pyo3_runtime.PanicException: no entry found for key
```
which turns out to be quite hard to track down.
In an effort to make tiktoken more robust for production use, this PR
adds the `sha256` hash of each of the downloaded files to
`openai_public.py` and augments `read_file` to check for the hash, if
provided, when the file is accessed from the cache or downloaded
directly. This causes errors to be flagged at file load time, rather
than when the files are used, and provides a more meaningful error
message indicating what might have gone wrong.
This also protects users of tiktoken from scenarios where a network
issue or MITM attack could have corrupted these files in transit.
0 commit comments