Maybe the input should just be `np.array`s of strings and have default tokenizers for stuff if necessary?