-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page #47090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Please check if this is the right direction. @pitrou @mapleFU @adamreeve BTW, some existing test cases will break if I switch the default value to limit 20,000 rows per page. I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks correct to me thanks @wgtmac.
I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream.
A default of 100k would still change behaviour though, and most of the time result in smaller pages being written. I think it probably makes sense to use 20k to align with Java and Rust, but it could be done as a separate PR if there are a lot of test changes needed.
I don't imagine this should break any downstream code, but we'd definitely want to call it out in the release notes as something for users to be aware of.
I should also mention that #47032 touches the same part of the code. It looks like the fix from that PR can easily be ported to this new code though. |
Should this PR be set to draft until it's ready? |
I think this is ready for review. @pitrou |
Thanks. I'm on vacation so I'm going to be a bit slow, sorry! |
IIUC, this PR may slightly affect CDC. Let me know if you have any feedback. @kszucs |
It depends on the value of the a) size limits are bigger than the cdc limits, then the pages are cut earlier than the size limits would happen: So in theory it shouldn't affect the CDC's effectiveness. We can also check this before merging using https://github.com/huggingface/dataset-dedupe-estimator |
@@ -155,6 +155,7 @@ class PARQUET_EXPORT ReaderProperties { | |||
ReaderProperties PARQUET_EXPORT default_reader_properties(); | |||
|
|||
static constexpr int64_t kDefaultDataPageSize = 1024 * 1024; | |||
static constexpr int64_t kDefaultMaxRowsPerPage = 20'000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have this feature as opt-in? Otherwise should we choose a bigger value to have the data page size triggered first?
Given the parquet type sizes we could end up much smaller data pages (even before encoding) than 1MB which could be unexpected to the users and also increase the overall metadata size:
- BOOLEAN: 1 bit boolean
- INT32: 32 bit signed ints
- INT64: 64 bit signed ints
- INT96: 96 bit signed ints (deprecated; only used by legacy implementations)
- FLOAT: IEEE 32-bit floating point values
- DOUBLE: IEEE 64-bit floating point values
- BYTE_ARRAY: arbitrarily long byte arrays
- FIXED_LEN_BYTE_ARRAY: fixed length byte arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, defaulting to 20000 will definitely create smaller pages of numeric types. 20000 is also used by parquet-java and arrow-rs and considered to be a good value per the discussion at https://lists.apache.org/thread/vsxmbvnx9gy5414cfo25mnwcj17h1xyp
Rationale for this change
Currently only page size is limited. We need to limit number of rows per page too.
What changes are included in this PR?
Add
parquet::WriterProperties::max_rows_per_page(int64_t max_rows)
to limit number of rows per data page.Are these changes tested?
TODO
Are there any user-facing changes?
Yes, users are allowed to set this config value.