[utoken](https://github.com/uhermjakob/utoken) is a general-purpose word tokenizer, in the spirit of sacremoses. Machine can provide a utoken implementation of the tokenizer interface.