Skip to content

Stanza's sentencizer only works when processors = 'tokenize,pos,lemma,depparse' #57

@namiyousef

Description

@namiyousef

Hi all,

I started an NLP project where I needed high accuracy sentence segmentation, and therefore decided to use stanza.

I was thrilled to find this library, since Spacy is quite intuitive. However, I found that the sentence segmentation only gets carried into spacy under certain conditions.

Baseline:

The baseline text is to use the Stanza model alone to see if the sentence segmentation works.

This is the simplest model that I could use, I simply turned on the tokenize processor.

Screenshot 2021-02-03 at 18 57 31

Test with Spacy-Stanza:

I then tried the same thing, but this time added the spacy-stanza wrapper.

Screenshot 2021-02-03 at 18 58 00

As shown above, the sentences were not actually tokenized.

Test with spacy-stanza with more processors on Stanza:

Screenshot 2021-02-03 at 18 56 23

It seems that the depparse processor is necessary, but this is rather confusing since the vanilla stanza model does not require it to tokenize.

Any help would be appreciated :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions