Byte Sized Batches #3729

westonpace · 2025-04-23T15:19:04Z

westonpace
Apr 23, 2025
Maintainer

Batch sizes are a common problem. When batch sizes are too large then users will, at worst, get offset overflow panics and, at best, use more RAM and receive poor performance. When batch sizes are too small then performance also suffers.

The root cause of this problem is that the ideal batch size is something that should be specified in bytes instead of rows, typically targeting the L2 / L3 cache size. However, batch sizes can only be specified in rows. As a result, the batch size has to be tuned for the dataset (or, in the case of column projection, the query). This adds a lot of complexity for the user.

We should change the file reader so that it can accept a batch_size_bytes parameter. We are not far from being able to support this. The I/O scheduling and decoding is already separate. This parameter should have no affect on the I/O. It should only influence the decoder. The drain method will need to be able to "drain up to X bytes". The only real complexity will be handling the case where there are multiple columns and some of them have variable length values. We could potentially do a binary search to find the correct breaking point or we could make a best effort estimate and fallback to something more complicated if our estimate is far off.

I'm putting this as a design proposal as it may have some impact on structural decoders although there is no one working on structural decoders that I am aware of so it shouldn't be too impactful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Byte Sized Batches #3729

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Byte Sized Batches #3729

Uh oh!

westonpace Apr 23, 2025 Maintainer

Replies: 0 comments

westonpace
Apr 23, 2025
Maintainer