Replies: 1 comment 1 reply
-
Hi, sorry for the late reply, I had no idea we had a discussion section that people actually post in. I do not know the answer to your question. The stuff generated on Kaikki.org just reads data as json objects, and I don't think anyone else would have much reason to do otherwise. I've found that trying to use regex with json is difficult, especially wiktextract output because you can have several nested fields with similar field names ( |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In the wiktextract file, key names are always followed by a colon then a space, such as:
"word": "A"...
"pos": "character"...
How important is this space in downstream code? I.e., what happens to existing code that uses wiktextract files if the json file had no spaces after the key, such as:
"word":"A"...
"pos":"character"...
I ask because I am looking into using orjason to parse/serialize this jsonl file for performance reasons (they claim a 30% speed improvement), but that library omits the space when writing json objects and there doesn't seem to be a way to keep it without doing a string replace after the json dumps() call. (It also removes spaces after a comma). But this extra step is going to reduce performance in using orjson.
To a json parser this whitespace is not significant. But I myself have been using regular expressions as a performance optimization to process lines when I don't need the full json structure in memory . So I realize that a fragile regex that includes a single hard-coded space would break in existing code.
I'm writing my own tools for dealing with these wiktextract files so it's possible I'll open source them at some point. That's why I'm concerned if that lack of space is going to be an issue. For example, I have a library that can extract certain subsets of objects from the raw extract files, re-arranging keys for ease of debugging and regex processing (putting word, pos, etymology_number, etc, first in the json line) and also only including parts of the json object and omitting others (e.g. translations.) But if I use orjson, even if I just deserialize then write each object to a new file with no changes, the files will no longer be the same because all the white space will have been removed.
Advice?
Thanks!
-Rob
Beta Was this translation helpful? Give feedback.
All reactions