Skip to content

Inputs shoult be tokenized only for training/evaluation sets? #54

@stelmath

Description

@stelmath

Hello,

Your README states:

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

Is this the case only for the training set and the optional evaluation set? During inference/prediction, do we also need to pass the source/target pair, pretokenized? Your demo uses the example:

src = 'awesome-align is awesome !'
tgt = '牛对齐 是 牛 !'

where the ! is pretokenized, as there is a space between it and the previous word ("awesome" in this case). Also, does this requirement stem from the original mBERT or is this your implementation requirement? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions