Use pysimdjson for parsing wat records #49

silentninja · 2025-02-28T17:32:03Z

Fixes #41

Made changes to the examples to use pysimdjson for parsing wat records and avoid causing the error mentioned in TkTech/pysimdjson#122.

sebastian-nagel

Hi @silentninja, thanks for the PR and testing it out.

Because of the observed incompatibilities, it looks like a drop-in use of simdjson isn't recommend.

If we use simdjson.loads(json_blob) or the parse method without recursion (simdjson.Parser().parse(json_blob, False), the performance gains are mostly lost.

I'm currently running a couple of performance tests and will report back about them on issue #41.

sebastian-nagel · 2025-05-23T12:00:08Z

server_count.py

@@ -21,7 +26,7 @@ def process_record(self, record):

        if self.is_wat_json_record(record):
            # WAT (response) record
-            record = json.loads(self.get_payload_stream(record).read())
+            record = self.json_extractor.parse(self.get_payload_stream(record).read())


Unfortunately, processing the JSON may raise an exception:

File "/mnt/data/wastl/proj/cc/git/cc-pyspark/sparkcc.py", line 377, in iterate_records for res in self.process_record(record): ~~~~~~~~~~~~~~~~~~~^^^^^^^^ File "/mnt/data/wastl/proj/cc/git/cc-pyspark/server_count.py", line 42, in process_record server_names.append(headers[header].strip()) ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'csimdjson.Array' object has no attribute 'strip'

Here a short snippet why this happens:

>>> import simdjson >>> type(simdjson.Parser().parse('[1,2]')) <class 'csimdjson.Array'> >>> type(simdjson.Parser().parse('[1,2]', True)) <class 'list'>

Would need to add an extra check:
... or isinstance(headers[header], simdjson.Array)

But then it's no drop-in replacement anymore.

sebastian-nagel · 2025-05-23T12:02:10Z

json_extractor.py

+        try:
+            import simdjson
+            self.json = simdjson.Parser()
+            self.parse = self.json.parse


Could write:

self.parse = lambda j: self.json.parse(j, True)

to force recursive parsing and avoid incompatibilities.

However, then one of the major performance benefits of the simdjson module fades away.

sebastian-nagel · 2025-05-23T12:02:57Z

wat_extract_links.py

@@ -422,9 +424,9 @@ def yield_links(self, src_url, base_url, links, url_attr, opt_attr=None,
        for l in links:
            if not l:
                continue
-            if url_attr in l:
+            if url_attr is not None and url_attr in l:


Good observation and thanks for testing this. Unfortunately, it's not the only incompatibility.

Use pysimdjson for parsing wat records

14c64f9

silentninja mentioned this pull request Mar 3, 2025

Use simdjson to read WAT payloads #41

Open

sebastian-nagel reviewed May 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use pysimdjson for parsing wat records #49

Use pysimdjson for parsing wat records #49

Uh oh!

silentninja commented Feb 28, 2025

Uh oh!

sebastian-nagel left a comment

Uh oh!

sebastian-nagel May 23, 2025

Uh oh!

sebastian-nagel May 23, 2025

Uh oh!

sebastian-nagel May 23, 2025

Uh oh!

sebastian-nagel May 23, 2025

Uh oh!

Uh oh!

Use pysimdjson for parsing wat records #49

Are you sure you want to change the base?

Use pysimdjson for parsing wat records #49

Uh oh!

Conversation

silentninja commented Feb 28, 2025

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 23, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 23, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 23, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!