-
Notifications
You must be signed in to change notification settings - Fork 89
Use pysimdjson for parsing wat records #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @silentninja, thanks for the PR and testing it out.
Because of the observed incompatibilities, it looks like a drop-in use of simdjson isn't recommend.
If we use simdjson.loads(json_blob)
or the parse method without recursion (simdjson.Parser().parse(json_blob, False)
, the performance gains are mostly lost.
I'm currently running a couple of performance tests and will report back about them on issue #41.
@@ -21,7 +26,7 @@ def process_record(self, record): | |||
|
|||
if self.is_wat_json_record(record): | |||
# WAT (response) record | |||
record = json.loads(self.get_payload_stream(record).read()) | |||
record = self.json_extractor.parse(self.get_payload_stream(record).read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, processing the JSON may raise an exception:
File "/mnt/data/wastl/proj/cc/git/cc-pyspark/sparkcc.py", line 377, in iterate_records
for res in self.process_record(record):
~~~~~~~~~~~~~~~~~~~^^^^^^^^
File "/mnt/data/wastl/proj/cc/git/cc-pyspark/server_count.py", line 42, in process_record
server_names.append(headers[header].strip())
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'csimdjson.Array' object has no attribute 'strip'
Here a short snippet why this happens:
>>> import simdjson
>>> type(simdjson.Parser().parse('[1,2]'))
<class 'csimdjson.Array'>
>>> type(simdjson.Parser().parse('[1,2]', True))
<class 'list'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would need to add an extra check:
... or isinstance(headers[header], simdjson.Array)
But then it's no drop-in replacement anymore.
try: | ||
import simdjson | ||
self.json = simdjson.Parser() | ||
self.parse = self.json.parse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could write:
self.parse = lambda j: self.json.parse(j, True)
to force recursive parsing and avoid incompatibilities.
However, then one of the major performance benefits of the simdjson module fades away.
@@ -422,9 +424,9 @@ def yield_links(self, src_url, base_url, links, url_attr, opt_attr=None, | |||
for l in links: | |||
if not l: | |||
continue | |||
if url_attr in l: | |||
if url_attr is not None and url_attr in l: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation and thanks for testing this. Unfortunately, it's not the only incompatibility.
Fixes #41
Made changes to the examples to use pysimdjson for parsing wat records and avoid causing the error mentioned in TkTech/pysimdjson#122.