Multi-word token expansion issue, misaligned tokens --> failed NER (German)

Hi,

thanks for the great project! It seems like stanza performs some pre-processing to the text, which results in misalignments and failed NER. ```UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer...```

Wouldn't it be a good solution for this to inform the user with a warning ("stanza performs extra preporcessing to the text... input: xxxx, output: yyyy, char indeces may be altered) and then simply proceed? I can imagine that many users, me included, are not fully interested in that char offset n remains char offset n after processing. 

Or is there some way to somehow execute the "stanza-custom" preprocessing before creating a doc with nlp(...)? This would also prevent the misalignments and gives more user control. Or is there some other fix that I'm not aware of?

spacy version: 3.0.6
stanza-spacy version: 1.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-word token expansion issue, misaligned tokens --> failed NER (German) #70

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-word token expansion issue, misaligned tokens --> failed NER (German) #70

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions