Replies: 1 comment
-
Hi @AmitMY, the offsets come from the original WNDB-formatted data and are literally byte offsets into the data files. As such, they are very specific to the original language and version of the data. The NLTK and original OMW data reuse the English synsets for their structure and just add new words, but the WN-LMF XML data from the current OMW that Wn uses provides unique identifiers for each lexicon's elements (although the form of those identifiers may contain the offsets for historical reasons, the identifiers are not meant to be decomposed or interpreted). If you are looking for a way to refer to a concept regardless of the language, you want the ILI (interlingual index). ILIs were created for this purpose. >>> en = wn.Wordnet("omw-en")
>>> en.synset("omw-en-00001740-n").ili
ILI('i35545')
>>> wn.synsets(ili="i35545")
[Synset('omw-en-00001740-n'), Synset('omw-sl-00001740-n'), Synset('omw-th-00001740-n'), Synset('omw-hr-00001740-n'), Synset('omw-it-00001740-n'), Synset('omw-sk-00001740-n'), Synset('omw-fr-00001740-n'), Synset('omw-ja-00001740-n'), Synset('omw-nl-00001740-n'), Synset('omw-ro-00001740-n'), Synset('omw-iwn-00001740-n'), Synset('omw-el-00001740-n'), Synset('omw-gl-00001740-n'), Synset('omw-fi-00001740-n'), Synset('omw-ca-00001740-n'), Synset('omw-arb-00001740-n'), Synset('omw-zsm-00001740-n'), Synset('omw-sq-00001740-n'), Synset('omw-eu-00001740-n'), Synset('omw-he-00001740-n'), Synset('omw-id-00001740-n'), Synset('omw-pt-00001740-n'), Synset('omw-lt-00001740-n'), Synset('omw-es-00001740-n'), Synset('omw-cmn-00001740-n')] With the web API, you can do this instead: As for your second problem where >>> from nltk.corpus import wordnet as wn
>>> wn.synset_from_pos_and_offset("v", 672433)
Synset('estimate.v.01')
>>> wn.synset_from_pos_and_offset("v", 672433).lemmas(lang="heb")
[]
>>> wn.synset_from_pos_and_offset("v", 672433).lemmas(lang="fra") # for comparison
[Lemma('estimate.v.01.estimer'), Lemma('estimate.v.01.juger'), Lemma('estimate.v.01.supposer'), Lemma('estimate.v.01.évaluer')] |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am using a dataset that refers to a synset in a non-conventional way, e.g.
omw.00672433-v
Previously, when using
nltk
I would:However, it seems like using the
wn
package, each synset has an identifier that is language specific? As in, I can access these two items for "entity":But not necessarily these two items:
Is there a way, similar to NLTK, to access literally
omw-00672433-v
without specifying the language? i do not know the supported languages in advance...Beta Was this translation helpful? Give feedback.
All reactions