Skip to content

Is it possibile to extend a trained BPE model's merge operations? #118

@pluiez

Description

@pluiez

Hi, here is the case.

  1. I pretrained a language model on English-only corpus, using BPE tokenization with vocab_size=32000.
  2. I want to continue training the model on Japanese corpus.

Since the tokenizer is unable to handle Japanese text, I'm wondering if it's possible to extend the original BPE tokenizer trained on English corpus to tokenize Japanese. So here is my idea.

  1. Train another BPE model on Japanese corpus with vocab_size=32000.
  2. Then merge the two BPE models as a new model and keep the tokenization on English unchanged so that English sentences tokenization are kept the same as before.
  3. The resulting vocab_size should be roughly 64000, in case there are some duplicates between English and Japanese vocablaries.

I'm not sure whether it's possible to merge the two BPE models as a new model and keep the tokenization on English unchanged. Any help would be appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions