-
Notifications
You must be signed in to change notification settings - Fork 473
Open
Description
Hi, here is the case.
- I pretrained a language model on English-only corpus, using BPE tokenization with vocab_size=32000.
- I want to continue training the model on Japanese corpus.
Since the tokenizer is unable to handle Japanese text, I'm wondering if it's possible to extend the original BPE tokenizer trained on English corpus to tokenize Japanese. So here is my idea.
- Train another BPE model on Japanese corpus with vocab_size=32000.
- Then merge the two BPE models as a new model and keep the tokenization on English unchanged so that English sentences tokenization are kept the same as before.
- The resulting vocab_size should be roughly 64000, in case there are some duplicates between English and Japanese vocablaries.
I'm not sure whether it's possible to merge the two BPE models as a new model and keep the tokenization on English unchanged. Any help would be appreciated!
Metadata
Metadata
Assignees
Labels
No labels