How important is having spaces after key names in the wiktextract file? #1055

rob-ross · 2025-02-28T18:05:50Z

rob-ross
Feb 28, 2025

In the wiktextract file, key names are always followed by a colon then a space, such as:
"word": "A"...
"pos": "character"...

How important is this space in downstream code? I.e., what happens to existing code that uses wiktextract files if the json file had no spaces after the key, such as:
"word":"A"...
"pos":"character"...

I ask because I am looking into using orjason to parse/serialize this jsonl file for performance reasons (they claim a 30% speed improvement), but that library omits the space when writing json objects and there doesn't seem to be a way to keep it without doing a string replace after the json dumps() call. (It also removes spaces after a comma). But this extra step is going to reduce performance in using orjson.

To a json parser this whitespace is not significant. But I myself have been using regular expressions as a performance optimization to process lines when I don't need the full json structure in memory . So I realize that a fragile regex that includes a single hard-coded space would break in existing code.

I'm writing my own tools for dealing with these wiktextract files so it's possible I'll open source them at some point. That's why I'm concerned if that lack of space is going to be an issue. For example, I have a library that can extract certain subsets of objects from the raw extract files, re-arranging keys for ease of debugging and regex processing (putting word, pos, etymology_number, etc, first in the json line) and also only including parts of the json object and omitting others (e.g. translations.) But if I use orjson, even if I just deserialize then write each object to a new file with no changes, the files will no longer be the same because all the white space will have been removed.

Advice?

Thanks!

-Rob

kristian-clausal · 2025-04-01T11:26:49Z

kristian-clausal
Apr 1, 2025
Collaborator

Hi, sorry for the late reply, I had no idea we had a discussion section that people actually post in.

I do not know the answer to your question. The stuff generated on Kaikki.org just reads data as json objects, and I don't think anyone else would have much reason to do otherwise. I've found that trying to use regex with json is difficult, especially wiktextract output because you can have several nested fields with similar field names (word comes to mind) so there's always a ton of false positives.

1 reply

rob-ross Apr 2, 2025
Author

That's why the first thing I do with a new extract file is I reorder the key order and sort the file. E.g.:

{
  "word": "daffydowndilly",
  "pos": "noun",
  "lang": "English",
  "lang_code": "en",
...

This makes it easy to find word/pos/langs. I can also weed out (or include) non-lemmas by searching for form_of, alt_of, etc. tags.
There's quite a lot you can do with just regular expressions, especially statistical analysis, and creating subsets of the data for ease of processing (e.g., a file for all en nouns.) My current statistics processing works mostly with regular expressions, except for the Top-Level Key counts. (file included).
wiktextract_stats_report.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How important is having spaces after key names in the wiktextract file? #1055

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How important is having spaces after key names in the wiktextract file? #1055

Uh oh!

rob-ross Feb 28, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

kristian-clausal Apr 1, 2025 Collaborator

Uh oh!

rob-ross Apr 2, 2025 Author

rob-ross
Feb 28, 2025

Replies: 1 comment 1 reply

kristian-clausal
Apr 1, 2025
Collaborator

rob-ross Apr 2, 2025
Author