Byte Sized Batches #3729
westonpace
started this conversation in
Lance File Format
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Batch sizes are a common problem. When batch sizes are too large then users will, at worst, get offset overflow panics and, at best, use more RAM and receive poor performance. When batch sizes are too small then performance also suffers.
The root cause of this problem is that the ideal batch size is something that should be specified in bytes instead of rows, typically targeting the L2 / L3 cache size. However, batch sizes can only be specified in rows. As a result, the batch size has to be tuned for the dataset (or, in the case of column projection, the query). This adds a lot of complexity for the user.
We should change the file reader so that it can accept a batch_size_bytes parameter. We are not far from being able to support this. The I/O scheduling and decoding is already separate. This parameter should have no affect on the I/O. It should only influence the decoder. The drain method will need to be able to "drain up to X bytes". The only real complexity will be handling the case where there are multiple columns and some of them have variable length values. We could potentially do a binary search to find the correct breaking point or we could make a best effort estimate and fallback to something more complicated if our estimate is far off.
I'm putting this as a design proposal as it may have some impact on structural decoders although there is no one working on structural decoders that I am aware of so it shouldn't be too impactful.
Beta Was this translation helpful? Give feedback.
All reactions