Skip to content

Use pysimdjson for parsing wat records #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

silentninja
Copy link
Contributor

Fixes #41

Made changes to the examples to use pysimdjson for parsing wat records and avoid causing the error mentioned in TkTech/pysimdjson#122.

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @silentninja, thanks for the PR and testing it out.

Because of the observed incompatibilities, it looks like a drop-in use of simdjson isn't recommend.

If we use simdjson.loads(json_blob) or the parse method without recursion (simdjson.Parser().parse(json_blob, False), the performance gains are mostly lost.

I'm currently running a couple of performance tests and will report back about them on issue #41.

@@ -21,7 +26,7 @@ def process_record(self, record):

if self.is_wat_json_record(record):
# WAT (response) record
record = json.loads(self.get_payload_stream(record).read())
record = self.json_extractor.parse(self.get_payload_stream(record).read())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, processing the JSON may raise an exception:

  File "/mnt/data/wastl/proj/cc/git/cc-pyspark/sparkcc.py", line 377, in iterate_records
    for res in self.process_record(record):
               ~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/mnt/data/wastl/proj/cc/git/cc-pyspark/server_count.py", line 42, in process_record
    server_names.append(headers[header].strip())
                        ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'csimdjson.Array' object has no attribute 'strip'

Here a short snippet why this happens:

>>> import simdjson
>>> type(simdjson.Parser().parse('[1,2]'))
<class 'csimdjson.Array'>
>>> type(simdjson.Parser().parse('[1,2]', True))
<class 'list'>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to add an extra check:
... or isinstance(headers[header], simdjson.Array)

But then it's no drop-in replacement anymore.

try:
import simdjson
self.json = simdjson.Parser()
self.parse = self.json.parse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could write:

self.parse = lambda j: self.json.parse(j, True)

to force recursive parsing and avoid incompatibilities.

However, then one of the major performance benefits of the simdjson module fades away.

@@ -422,9 +424,9 @@ def yield_links(self, src_url, base_url, links, url_attr, opt_attr=None,
for l in links:
if not l:
continue
if url_attr in l:
if url_attr is not None and url_attr in l:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation and thanks for testing this. Unfortunately, it's not the only incompatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use simdjson to read WAT payloads
2 participants