Column group with horizontal compaction and split #3995
Replies: 3 comments 1 reply
-
That's a great idea. Quite similar with the idea of column famlily(as I know, some companies implement the same ability of spitting files by this concept based on parquet). |
Beta Was this translation helpful? Give feedback.
-
Some thoughts on the implementation plan: I think we probably want to do horizontal compaction first, and then column group can be on top as a syntax sugar. horizontal compaction with explicit inputFor example,
where the compactColumnOptions take a set of columns, and the function will go through all the fragments and merge related data files to the same file. We can start simple by assuming that all the provided columns are independent, i.e. we only need to merge files, no need to split files. So for example if fragment has And then there are 2 extension cases:
The existing logic in https://github.com/lancedb/lance/blob/main/rust/lance/src/dataset/optimize.rs might be a good starting point to take a look of how vertical compaction works today. We likely also need to introduce a new type of operation in transactions. Currently we have Merge
we should create a new operation that represents replacing column data files, without the need to create new fragments. See https://github.com/lancedb/lance/blob/main/rust/lance/src/dataset/transaction.rs for more details. allow recording column group and use that to optimize columnsnow basically the columns to replace becomes optional in OptimizeColumnOptions, and if not provided it will just optimize each fragment based on column group. Side node: maybe we should call it column family, since that is a known concept in systems like HBase and users are easier to understand |
Beta Was this translation helpful? Give feedback.
-
Hey @jackye1995, Thanks for raising this! I think the plan makes sense so I’ll take a stab at this. I’ll keep you all posted with updates. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Based on https://discord.com/channels/1030247538198061086/1030247538667827251/1382081825635307761
The Lance format allows storing different column data in different files. Today the file layout is defined based on the
add_column
workload organically, where every time the user adds new column data, they are stored as new data files.However, users might want to group column data in a way that is different from how the data is added, due to requirements like data loader, column size, access control. Some use cases:
The Lance table format can add a "column group" concept, so that data files within each fragment can be horizontally compacted or splitted based on how user defines the column group layout.
Beta Was this translation helpful? Give feedback.
All reactions