-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or request
Description
Tensorflow supports two (or three) different types of WordPiece tokenizers.
Could be worth testing to use the FastWordPiece tokenizer, since it can build the model from a vocab directly and claims to be faster as mentioned:
- does the wordpiece tool the same with the bert one? tensorflow/text#116
- Generate BERT wordpiece vocab? tensorflow/text#414
But is will likely also require a bit more setup (https://www.tensorflow.org/text/guide/subwords_tokenizer#overview), as WordPiece only see to split words, but the BertTokenizer splits sentences
Goal
- Compare the different tokenizers and see if they yield the same results
- Compare if the new tokenizer can be saved as a Reusable SavedModel
- Test if the models that previously fails now work Tokenizers do not convert tokens correctly #4
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or request