Skip to content

[PLT-999] Vb/chunk by size plt 999 #1648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 4, 2024
Merged

Conversation

vbrodsky
Copy link
Contributor

@vbrodsky vbrodsky commented Jun 3, 2024

Description

This PR updates our chunking logic for Dataset create_data_rows and upsert_data_rows to chunk by size of file, not the number of data rows. This will result in generating files of similar size, making processing more predictable and reducing a chance of error or a performance issue due to a very large file

We have set the file size limit to 10MB based on the following considerations:

  • in an average case, each row in the files consists of a hash with row_data and, possibly, global or external key. In our estimate the json-encoded size of such a hash is around 500 bytes
  • current chunk limit of a file is 10K lines
  • we have set the chunk size for a file to be 10MB

Fixes # (issue)

Type of change

Please delete options that are not relevant.

  • Internal optimization

All Submissions

  • Have you followed the guidelines in our Contributing document?
  • Have you provided a description?
  • Are your changes properly formatted?

New Feature Submissions

  • Does your submission pass tests?
  • Have you added thorough tests for your new feature?
  • Have you commented your code, particularly in hard-to-understand areas?
  • Have you added a Docstring?

Changes to Core Features

  • Have you written new tests for your core changes, as applicable?
  • Have you successfully run tests with your changes locally?
  • Have you updated any code comments, as applicable?

@vbrodsky vbrodsky requested a review from a team as a code owner June 3, 2024 20:35
@vbrodsky vbrodsky marked this pull request as draft June 3, 2024 20:35
@vbrodsky vbrodsky force-pushed the VB/chunk-by-size_PLT-999 branch 2 times, most recently from 8c70057 to a7514ce Compare June 3, 2024 22:07
…ws_sync, remove unused MAX_DATAROW_PER_API_OPERATION
@vbrodsky vbrodsky force-pushed the VB/chunk-by-size_PLT-999 branch from a7514ce to ac76e38 Compare June 3, 2024 22:09
@vbrodsky vbrodsky marked this pull request as ready for review June 3, 2024 22:32
@vbrodsky vbrodsky force-pushed the VB/chunk-by-size_PLT-999 branch from 2aa342b to 6ea8239 Compare June 4, 2024 02:24
return UploadManifest(source="SDK",
item_count=len(specs),
chunk_uris=chunk_uris)
return UploadManifest(source="SDK",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this in this and previous review, but can this be an enum?

@vbrodsky vbrodsky force-pushed the VB/chunk-by-size_PLT-999 branch from 6ea8239 to 3bcdc23 Compare June 4, 2024 16:26
@vbrodsky vbrodsky merged commit 7c985db into develop Jun 4, 2024
22 checks passed
@vbrodsky vbrodsky deleted the VB/chunk-by-size_PLT-999 branch June 4, 2024 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants