-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Hi,
thanks for the great project! It seems like stanza performs some pre-processing to the text, which results in misalignments and failed NER. UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer...
Wouldn't it be a good solution for this to inform the user with a warning ("stanza performs extra preporcessing to the text... input: xxxx, output: yyyy, char indeces may be altered) and then simply proceed? I can imagine that many users, me included, are not fully interested in that char offset n remains char offset n after processing.
Or is there some way to somehow execute the "stanza-custom" preprocessing before creating a doc with nlp(...)? This would also prevent the misalignments and gives more user control. Or is there some other fix that I'm not aware of?
spacy version: 3.0.6
stanza-spacy version: 1.2