Top-level "Form_of" element #1040

rob-ross · 2025-02-18T22:31:34Z

rob-ross
Feb 18, 2025

When I look at a subset of the raw wiki extract data for just en and mul, out of 1,437,452 entries there are only 4 that have a 'form_of" element at the top level. Any other "form_of" element is nested inside a sense. The 4 words are:
Godself, himself, oneself, xemself. (I'm looking at the 2025-01-31 dated extract.)

When I compare to a wiktionary entry like "herself" or "themself" I don't see any obvious reasons why they would be processed differently. I don't know if this is a "bug" per se. But since there are only 4 top-level form_of element it makes me think they shouldn't exist at the top level. When I do the same search on the full wiktextract (20GB) file of 9,907,333 words, there are only 2,689 top-level form_of elements. That's only 0.03% of total entries. (En + mul is only .00028% of total)

I don't have any insight into how languages other than English arrange their information. But for en+mul my gut says something is incorrect, either in how the word entries on wiktionary are formatted or in the extraction code. In a previous discussion where I found 3 json entries that were "orphaned" Thesaurus entries, the problem was on the wikt site. So perhaps this is also the case for these 4 words.

thanks for maintaining this useful tool!

Rob

xxyzz · 2025-02-19T00:29:41Z

xxyzz
Feb 19, 2025
Collaborator

The "form_of" list inside "sense" are extracted from gloss list(# wikitext), the four pages you mentioned are extracted from nodes above gloss lists. I guess en edition extractor code could handle the common case of "* of" form-of templates in gloss list, but couldn't match texts in these four pages.

Example page of common case of "form_of": dictionaries

0 replies

rob-ross · 2025-03-25T22:38:08Z

rob-ross
Mar 25, 2025
Author

Ok, I finally understand what's going on. Note, I'm not claiming there is an "issue" or a "problem." I don't know enough to make that determination :) But, I am just observing that most form_of elements are found within a sense dict. I can see what the issue is with these 4 words, they are using a different header template argument than other words. If you compare himself and herself I think it's very clear what is different. In "herself" (form_of inside sense):

en-pron|desc=the third person singular, feminine, personal pronoun|the reflexive form of|she| ...

whereas in "himself" (form_of at top-level) it uses:

en-pron|reflexive form of|he|desc=the third person singular, masculine, personal pronoun|

in the header template.

I don't know how the entries for "reflexive form" are "supposed to be." Especially since wiktionary isn't designed specifically as a machine readable format, there's always some fuzziness involved. I just note that in the spirit of consistency, I would be motivated to change them on wiktionary to match the majority of the other entries. I will update "himself to use the syntax of "herself" regarding the reflexive form argument in the header and see if that results in the form_of being moved into a sense.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Top-level "Form_of" element #1040

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Top-level "Form_of" element #1040

Uh oh!

rob-ross Feb 18, 2025

Replies: 2 comments

Uh oh!

Uh oh!

xxyzz Feb 19, 2025 Collaborator

Uh oh!

rob-ross Mar 25, 2025 Author

rob-ross
Feb 18, 2025

xxyzz
Feb 19, 2025
Collaborator

rob-ross
Mar 25, 2025
Author