Skip to content

Refactor uproot-raw to chunk the processing #1059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 28, 2025
Merged

Conversation

ponyisi
Copy link
Collaborator

@ponyisi ponyisi commented May 19, 2025

Sufficiently large uproot-raw requests could allocate amounts of memory too large to fit in the default 1 GB Kubernetes limit (as they built the final results in-memory). We now instead write out the results of each uproot.iterate chunk as they arrive. Take the opportunity to refactor a few aspects of the writing code.

Fixes #1045

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the uproot-raw processing to chunk transformation outputs, reducing memory usage under Kubernetes limits.

  • Updated test expectations to reflect the new processing output.
  • Introduced timing wrappers and incremental processing in the transformation and query translation code.
  • Modified the query translation to yield results on the fly instead of collecting them in dictionaries.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
code_generator_raw_uproot/tests/test_src.py Updated expected hash in the code generation test to match the new output.
code_generator_raw_uproot/servicex/templates/transform_single_file.py Added timing functions and refactored transformation logic for chunked processing along with enhanced error messaging.
code_generator_raw_uproot/servicex/raw_uproot_code_generator/request_translator.py Changed query processing to yield results incrementally and improved variable naming for clarity.
Comments suppressed due to low confidence (1)

code_generator_raw_uproot/servicex/raw_uproot_code_generator/request_translator.py:143

  • [nitpick] Consider renaming 'arrfound' to a more descriptive name such as 'chunk_found' or 'has_found_chunk' for improved code readability.
arrfound = False

@ponyisi
Copy link
Collaborator Author

ponyisi commented May 19, 2025

I should note this is vaguely urgent, Marc has seen workflows that fail without this.

Copy link
Contributor

@BenGalewsky BenGalewsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not qualified to review this code, but I appreciate the idea

@ponyisi ponyisi merged commit d436e73 into develop May 28, 2025
75 checks passed
@ponyisi ponyisi deleted the uproot-raw-iterate branch May 28, 2025 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use extend in uproot-raw transformer so we don't have to materialize entire array at once
2 participants