-
Notifications
You must be signed in to change notification settings - Fork 62
Open
Description
Hi everyone,
I just found a problem when trying to analyze a French sentence. When I run the following code:
snlp = stanza.Pipeline(lang="fr", verbose=False)
stanzanlp = StanzaLanguage(snlp)
text = "C'est l'un des grands messages passés par Bruno Le Maire, ce matin sur RTL."
doc = stanzanlp(text)
I get this error:
/home/victor/miniconda3/envs/nlp/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ["C'", 'est', "l'", 'un', 'de', 'les', 'grands', 'messages', 'passés', 'par', 'Bruno', 'Le', 'Maire', ',', 'ce', 'matin', 'sur', 'RTL.']
Entities: [('Bruno Le Maire', 'PER', 42, 56), ('RTL.', 'ORG', 71, 75)]
after removing the cwd from sys.path.
Analyzing the same text with the default French model in spaCy, I get almost the same tokens: take a look at the final stop.
doc = spacynlp(text)
for token in doc:
print(token.text, token.idx)
for ent in doc.ents:
print(ent.text, ent.label_)
C' 0
est 2
l' 6
un 8
des 11
grands 15
messages 22
passés 31
par 38
Bruno 42
Le 48
Maire 51
, 56
ce 58
matin 61
sur 67
RTL 71
. 74
Bruno Le Maire PER
RTL ORG
Is anyone having the same issues?
aishwarya-agrawal, TheophileBlard, lpossberg and bablf
Metadata
Metadata
Assignees
Labels
No labels