Why handle "\n" specially to replace the "<|endoftext|>" in BPETokenizerSimple #813

myme5261314 · 2025-09-09T07:28:08Z

myme5261314
Sep 9, 2025

In the bpe-from-scratch.ipynb, below is the code snippet from 23th cell.

    def load_vocab_and_merges_from_openai(self, vocab_path, bpe_merges_path):
        """
        Load pre-trained vocabulary and BPE merges from OpenAI's GPT-2 files.

        Args:
            vocab_path (str): Path to the vocab file (GPT-2 calls it 'encoder.json').
            bpe_merges_path (str): Path to the bpe_merges file  (GPT-2 calls it 'vocab.bpe').
        """
        # Load vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            # Convert loaded vocabulary to correct format
            self.vocab = {int(v): k for k, v in loaded_vocab.items()}
            self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()}

        # Handle newline character without adding a new token
        if "\n" not in self.inverse_vocab:
            # Use an existing token ID as a placeholder for '\n'
            # Preferentially use "<|endoftext|>" if available
            fallback_token = next((token for token in ["<|endoftext|>", "Ġ", ""] if token in self.inverse_vocab), None)
            if fallback_token is not None:
                newline_token_id = self.inverse_vocab[fallback_token]
            else:
                # If no fallback token is available, raise an error
                raise KeyError("No suitable token found in vocabulary to map '\\n'.")

            self.inverse_vocab["\n"] = newline_token_id
            self.vocab[newline_token_id] = "\n"

The last line will replace the vocab's "<|endoftext|>" with "\n". I'm not sure why the code needs to do the substitution.
I found no suspicious "\n" special actions in bpe_openai_gpt2.py.

def get_encoder(model_name, models_dir):
    with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f:
        encoder = json.load(f)
    with open(os.path.join(models_dir, model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f:
        bpe_data = f.read()
    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
    return Encoder(encoder=encoder, bpe_merges=bpe_merges)

I've confirmed through tiktoken interactive app, the token id of "\n" is 198, and the "<|endoftext|>" remains its original token id 50256 from encoder.json.

But I found the token id 198 in encoder.json is not "\n" character but "\u010a": 198 which rendered as

>>> "\u010a"
'Ċ'

Anyone know the details?

myme5261314 · 2025-09-09T08:07:40Z

myme5261314
Sep 9, 2025
Author

Maybe related to the following code in BPETokenizerSimple.encode. I'm not sure.

            # Check if any disallowed special tokens are in the remainder
            disallowed = [
                tok for tok in self.inverse_vocab
                if tok.startswith("<|") and tok.endswith("|>") and tok in text and tok not in allowed_special
            ]
            if disallowed:
                raise ValueError(f"Disallowed special tokens encountered in text: {disallowed}")
    
        # If no special tokens, or remaining text after special token split:
        tokens = []
        lines = text.split("\n")
        for i, line in enumerate(lines):
            if i > 0:
                tokens.append("\n")
            words = line.split()
            for j, word in enumerate(words):
                if j == 0 and i > 0:
                    tokens.append("Ġ" + word)
                elif j == 0:
                    tokens.append(word)
                else:
                    tokens.append("Ġ" + word)
    
        for token in tokens:
            if token in self.inverse_vocab:
                token_ids.append(self.inverse_vocab[token])
            else:
                token_ids.extend(self.tokenize_with_bpe(token))
    
        return token_ids

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why handle "\n" specially to replace the "<|endoftext|>" in BPETokenizerSimple #813

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Why handle "\n" specially to replace the "<|endoftext|>" in BPETokenizerSimple #813

Uh oh!

myme5261314 Sep 9, 2025

Replies: 1 comment

Uh oh!

Uh oh!

myme5261314 Sep 9, 2025 Author

myme5261314
Sep 9, 2025

myme5261314
Sep 9, 2025
Author