Scale Fragments beyond a flat list #4000
jackye1995
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Recently there has been quite some asks during talks that "is it truly scalable for Lance to store all the fragments as a flat list"? This mainly comes from people using Iceberg since Iceberg's 2 level manifest layout is designed with the goal to be more scalable.
I think Delta has already proven that a flat list is sufficient especially when the table is constantly being comapacted anyway, however we are storing things in a protobuf file rather than Parquet checkpoint like Delta, so there might be some performance and memory concerns if we have a lot of fragments, for example when doing a lot of concurrent writes.
Overall, there is no data at this moment to know what is the breaking point that the slowness is becoming not tolerable or the fragment list is taking too much memory. Maybe the first step is to get that data point.
And suppose we have a need to scale up, we can always introduce some more complex data structure to store fragments instead of a flat list. One big issue with Iceberg 2 level manifest is that it can get skew, so a self-balancing b-tree might be better. If we want to be more fancy, we can use for example a b-epsilon tree to make sure we can always access recently added fragments more efficiently.
So far I think it is fine, but just create this discussion here in case anyone is interested to explore further in this domain.
Beta Was this translation helpful? Give feedback.
All reactions