Why handle "\n" specially to replace the "<|endoftext|>" in BPETokenizerSimple #813
Unanswered
myme5261314
asked this question in
Q&A
Replies: 1 comment
-
Maybe related to the following code in # Check if any disallowed special tokens are in the remainder
disallowed = [
tok for tok in self.inverse_vocab
if tok.startswith("<|") and tok.endswith("|>") and tok in text and tok not in allowed_special
]
if disallowed:
raise ValueError(f"Disallowed special tokens encountered in text: {disallowed}")
# If no special tokens, or remaining text after special token split:
tokens = []
lines = text.split("\n")
for i, line in enumerate(lines):
if i > 0:
tokens.append("\n")
words = line.split()
for j, word in enumerate(words):
if j == 0 and i > 0:
tokens.append("Ġ" + word)
elif j == 0:
tokens.append(word)
else:
tokens.append("Ġ" + word)
for token in tokens:
if token in self.inverse_vocab:
token_ids.append(self.inverse_vocab[token])
else:
token_ids.extend(self.tokenize_with_bpe(token))
return token_ids |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In the
bpe-from-scratch.ipynb
, below is the code snippet from 23th cell.The last line will replace the vocab's "<|endoftext|>" with "\n". I'm not sure why the code needs to do the substitution.
I found no suspicious "\n" special actions in
bpe_openai_gpt2.py
.I've confirmed through tiktoken interactive app, the token id of "\n" is 198, and the "<|endoftext|>" remains its original token id 50256 from

encoder.json
.But I found the token id 198 in
encoder.json
is not "\n" character but"\u010a": 198
which rendered asAnyone know the details?
Beta Was this translation helpful? Give feedback.
All reactions