Offset misalignment in NER using the Stanza tokenizer for French

Hi everyone,

I just found a problem when trying to analyze a French sentence. When I run the following code:

```python
snlp = stanza.Pipeline(lang="fr", verbose=False)
stanzanlp = StanzaLanguage(snlp)

text = "C'est l'un des grands messages passés par Bruno Le Maire, ce matin sur RTL."
doc = stanzanlp(text)
```

I get this error:

```
/home/victor/miniconda3/envs/nlp/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ["C'", 'est', "l'", 'un', 'de', 'les', 'grands', 'messages', 'passés', 'par', 'Bruno', 'Le', 'Maire', ',', 'ce', 'matin', 'sur', 'RTL.']
Entities: [('Bruno Le Maire', 'PER', 42, 56), ('RTL.', 'ORG', 71, 75)]
  after removing the cwd from sys.path.
```

Analyzing the same text with the default French model in spaCy, I get almost the same tokens: take a look at the final stop.

```python
doc = spacynlp(text)

for token in doc:
    print(token.text, token.idx)
    
for ent in doc.ents:
    print(ent.text, ent.label_)
```

```
C' 0
est 2
l' 6
un 8
des 11
grands 15
messages 22
passés 31
par 38
Bruno 42
Le 48
Maire 51
, 56
ce 58
matin 61
sur 67
RTL 71
. 74
Bruno Le Maire PER
RTL ORG
```

Is anyone having the same issues?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Offset misalignment in NER using the Stanza tokenizer for French #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Offset misalignment in NER using the Stanza tokenizer for French #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions