-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add possessive quantifiers to avoid catastrophic backtracking #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
db3155c
58cf8f6
21c5688
5f07fc2
51c8a8a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,23 @@ | |
from .test_helpers import ENCODING_FACTORIES, MAX_EXAMPLES | ||
|
||
|
||
@pytest.mark.skip(reason="Takes a really long time to finish, but was added to reproduce a crash.") | ||
@pytest.mark.parametrize("make_enc", ENCODING_FACTORIES) | ||
def test_extremely_big_encoding(make_enc: Callable[[], tiktoken.Encoding]): | ||
enc = make_enc() | ||
for c in ["^", "0", "a", "'s"]: # TODO " ", "\n" are still failing | ||
print(f"Validating `{c}`") | ||
|
||
big_value = c * 1_000_000 | ||
assert big_value == enc.decode(enc.encode(big_value)) | ||
|
||
big_value = " " + big_value | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. space is often optional at the beginning, this way the backtracking can reach the space - let's test that as well |
||
assert big_value == enc.decode(enc.encode(big_value)) | ||
|
||
big_value = big_value + "\n" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. some groups require a newline at the end, stress those paths as well |
||
assert big_value == enc.decode(enc.encode(big_value)) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def test_simple(): | ||
enc = tiktoken.get_encoding("gpt2") | ||
assert enc.encode("hello world") == [31373, 995] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not absolutely necessary, but adds a tiny speed increase