Inputs shoult be tokenized only for training/evaluation sets?

Hello,

Your README states:

> Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

Is this the case only for the training set and the optional evaluation set? During inference/prediction, do we also need to pass the source/target pair, pretokenized? Your [demo](https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing) uses the example: 

> src = 'awesome-align is awesome !'
> tgt = '牛对齐 是 牛 ！'

where the `!` is pretokenized, as there is a space between it and the previous word ("awesome" in this case). Also, does this requirement stem from the original mBERT or is this your implementation requirement? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inputs shoult be tokenized only for training/evaluation sets? #54

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inputs shoult be tokenized only for training/evaluation sets? #54

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions