Replies: 2 comments
-
The "form_of" list inside "sense" are extracted from gloss list( Example page of common case of "form_of": dictionaries |
Beta Was this translation helpful? Give feedback.
-
Ok, I finally understand what's going on. Note, I'm not claiming there is an "issue" or a "problem." I don't know enough to make that determination :) But, I am just observing that most form_of elements are found within a sense dict. I can see what the issue is with these 4 words, they are using a different header template argument than other words. If you compare himself and herself I think it's very clear what is different. In "herself" (form_of inside sense): en-pron|desc=the third person singular, feminine, personal pronoun|the reflexive form of|she| ... whereas in "himself" (form_of at top-level) it uses: en-pron|reflexive form of|he|desc=the third person singular, masculine, personal pronoun| in the header template. I don't know how the entries for "reflexive form" are "supposed to be." Especially since wiktionary isn't designed specifically as a machine readable format, there's always some fuzziness involved. I just note that in the spirit of consistency, I would be motivated to change them on wiktionary to match the majority of the other entries. I will update "himself to use the syntax of "herself" regarding the reflexive form argument in the header and see if that results in the form_of being moved into a sense. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
When I look at a subset of the raw wiki extract data for just en and mul, out of 1,437,452 entries there are only 4 that have a 'form_of" element at the top level. Any other "form_of" element is nested inside a sense. The 4 words are:
Godself, himself, oneself, xemself. (I'm looking at the 2025-01-31 dated extract.)
When I compare to a wiktionary entry like "herself" or "themself" I don't see any obvious reasons why they would be processed differently. I don't know if this is a "bug" per se. But since there are only 4 top-level form_of element it makes me think they shouldn't exist at the top level. When I do the same search on the full wiktextract (20GB) file of 9,907,333 words, there are only 2,689 top-level form_of elements. That's only 0.03% of total entries. (En + mul is only .00028% of total)
I don't have any insight into how languages other than English arrange their information. But for en+mul my gut says something is incorrect, either in how the word entries on wiktionary are formatted or in the extraction code. In a previous discussion where I found 3 json entries that were "orphaned" Thesaurus entries, the problem was on the wikt site. So perhaps this is also the case for these 4 words.
thanks for maintaining this useful tool!
Beta Was this translation helpful? Give feedback.
All reactions