Skip to content

tokenizer.encode function`s param add_special_tokens=False not work. #765

@xiaohan2909

Description

@xiaohan2909

🐛 Describe the bug

the tokenizer is from olmo.tokenizer package.
Keep token id to the default value 50279, when the default tokenizer is loaded, run the code below:
input:
tokenizer.encode("hello", add_special_tokens=False)
output :
[25521, 50279]
The Result shows that the parameter 'add_special_tokens=False' dose not work.
And I find the reason is in /olmo/tokenizer.py line 183:
batch_encoding = self.base_tokenizer.encode_batch(inputs)
the param 'add_special_tokens' didn't pass to the base_tokenizer's encode function.
tokenizer_error

I can find the bug because it caused an assertion in /scripts/prepare_tulu_data.py line 90
prepare_tulu_l90

Versions

0.5.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugAn issue about a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions