WordNet Identifiers without language #274

AmitMY · 2025-07-13T15:07:14Z

AmitMY
Jul 13, 2025

I am using a dataset that refers to a synset in a non-conventional way, e.g. omw.00672433-v

Previously, when using nltk I would:

import nltk

nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("extended_omw")

from nltk.corpus import wordnet as wn

_, omw_id = synset_id.split(".")
omw_offset, omw_pos = omw_id.split("-")
synset = wn.synset_from_pos_and_offset(omw_pos, int(omw_offset))

However, it seems like using the wn package, each synset has an identifier that is language specific? As in, I can access these two items for "entity":

But not necessarily these two items:

http://127.0.0.1:8000/lexicons/omw-en:1.4/synsets/omw-en-00672433-v (exists)
http://127.0.0.1:8000/lexicons/omw-he:1.4/synsets/omw-he-00672433-v (does not exist)

Is there a way, similar to NLTK, to access literally omw-00672433-v without specifying the language? i do not know the supported languages in advance...

goodmami · 2025-07-13T19:30:09Z

goodmami
Jul 13, 2025
Maintainer

Hi @AmitMY, the offsets come from the original WNDB-formatted data and are literally byte offsets into the data files. As such, they are very specific to the original language and version of the data. The NLTK and original OMW data reuse the English synsets for their structure and just add new words, but the WN-LMF XML data from the current OMW that Wn uses provides unique identifiers for each lexicon's elements (although the form of those identifiers may contain the offsets for historical reasons, the identifiers are not meant to be decomposed or interpreted).

If you are looking for a way to refer to a concept regardless of the language, you want the ILI (interlingual index). ILIs were created for this purpose.

>>> en = wn.Wordnet("omw-en")
>>> en.synset("omw-en-00001740-n").ili
ILI('i35545')
>>> wn.synsets(ili="i35545")
[Synset('omw-en-00001740-n'), Synset('omw-sl-00001740-n'), Synset('omw-th-00001740-n'), Synset('omw-hr-00001740-n'), Synset('omw-it-00001740-n'), Synset('omw-sk-00001740-n'), Synset('omw-fr-00001740-n'), Synset('omw-ja-00001740-n'), Synset('omw-nl-00001740-n'), Synset('omw-ro-00001740-n'), Synset('omw-iwn-00001740-n'), Synset('omw-el-00001740-n'), Synset('omw-gl-00001740-n'), Synset('omw-fi-00001740-n'), Synset('omw-ca-00001740-n'), Synset('omw-arb-00001740-n'), Synset('omw-zsm-00001740-n'), Synset('omw-sq-00001740-n'), Synset('omw-eu-00001740-n'), Synset('omw-he-00001740-n'), Synset('omw-id-00001740-n'), Synset('omw-pt-00001740-n'), Synset('omw-lt-00001740-n'), Synset('omw-es-00001740-n'), Synset('omw-cmn-00001740-n')]

With the web API, you can do this instead:

http://127.0.0.1:8000/synsets?ili=i35545

As for your second problem where omw-he-00672433-v does not exist, this is because the Hebrew wordnet does not have a synset for that concept. This is true in the NLTK, as well:

>>> from nltk.corpus import wordnet as wn
>>> wn.synset_from_pos_and_offset("v", 672433)
Synset('estimate.v.01')
>>> wn.synset_from_pos_and_offset("v", 672433).lemmas(lang="heb")
[]
>>> wn.synset_from_pos_and_offset("v", 672433).lemmas(lang="fra")  # for comparison
[Lemma('estimate.v.01.estimer'), Lemma('estimate.v.01.juger'), Lemma('estimate.v.01.supposer'), Lemma('estimate.v.01.évaluer')]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WordNet Identifiers without language #274

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

WordNet Identifiers without language #274

Uh oh!

AmitMY Jul 13, 2025

Replies: 1 comment

Uh oh!

goodmami Jul 13, 2025 Maintainer

AmitMY
Jul 13, 2025

goodmami
Jul 13, 2025
Maintainer