Is it possibile to extend a trained BPE model's merge operations?

Hi, here is the case. 
1. I pretrained a language model on English-only corpus, using BPE tokenization with vocab_size=32000. 
2. I want to continue training the model on Japanese corpus.

Since the tokenizer is unable to handle Japanese text, I'm wondering if it's possible to extend the original BPE tokenizer trained on English corpus to tokenize Japanese. So here is my idea.
1. Train another BPE model on Japanese corpus with vocab_size=32000.
2. Then merge the two BPE models as a new model and keep the tokenization on English unchanged so that English sentences tokenization are kept the same as before. 
3. The resulting vocab_size should be roughly 64000, in case there are some duplicates between English and Japanese vocablaries.

I'm not sure whether it's possible to merge the two BPE models as a new model and keep the tokenization on English unchanged. Any help would be appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possibile to extend a trained BPE model's merge operations? #118

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Is it possibile to extend a trained BPE model's merge operations? #118

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions