Skip to content

Multi-word token expansion issue, misaligned tokens --> failed NER (German) #70

@flipz357

Description

@flipz357

Hi,

thanks for the great project! It seems like stanza performs some pre-processing to the text, which results in misalignments and failed NER. UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer...

Wouldn't it be a good solution for this to inform the user with a warning ("stanza performs extra preporcessing to the text... input: xxxx, output: yyyy, char indeces may be altered) and then simply proceed? I can imagine that many users, me included, are not fully interested in that char offset n remains char offset n after processing.

Or is there some way to somehow execute the "stanza-custom" preprocessing before creating a doc with nlp(...)? This would also prevent the misalignments and gives more user control. Or is there some other fix that I'm not aware of?

spacy version: 3.0.6
stanza-spacy version: 1.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions