Partitioning (or not) for Lance #4125
Replies: 10 comments 36 replies
-
Definitions and Analysis of the Current SolutionsPartitioningPartitioning is the technique of dividing a large table into smaller, logical segments—called partitions—based on the values of one or more columns, known as partition keys. Each partition groups together rows that share the same key values. Data lake formats like Hive, Iceberg and Delta supports partitioning. With partitioning, the partition column/transform dictates the exact physical layout on disk. Each data file must contain data in one and only one partition. One partition can contain multiple data files. Partitioning is used in the following ways in query execution:
Complaints about partitioning:
SortingIn the context of data lake formats, sorting is the process of ordering rows within a partition or data file based on the values of one or more sort keys. It does not affect partition structure but optimizes data layout within partitions or files. It helps a query execution in the following ways:
Complaints about sorting:
Z-orderingZ-ordering is an alternative to sorting within a partition. It improves the effectiveness when there are multiple potential sort keys, and the filter order is not deterministic. In general it is a more flexible solution but is more complicated in implementation. Liquid ClusteringLiquid clustering is an alternative to partitioning + sorting/z-ordering. User defines the clustering keys, and the underlying process takes care of clustering data based on the clustering key. This is a logical rather than a physical layout definition, thus there is no clear relationship of data file and partition. A data file can contain data in multiple values of the clustering key, but typically the number of values is small. In Delta 3.0+ for example, the clustering mechanism is implemented using Hibert Curve. Compared to partitioning + sorting/z-ordering, the pro of liquid clustering is:
The cons are:
ShardingIn most cases above, the read and write path are combined together. If you define a partition strategy, it means you must write according to that strategy, and read is optimized based on that strategy. However, with liquid clustering, that data write path can be separated from read path, and liquid clustering is only a definition of how data should be clustered eventually, which has nothing to do with how data should be written to the table. With these 2 concepts separated, now there is another consideration of - sharding, which is how should the writers be distributed. As a part of the primary key upsert proposal, we already introduced the concept of a “region”. Parallel writers could be assigned to different regions of the table based on the primary key, and some sort of range-based or hash-based strategy can be used to determine the region. This is outside the scope of this discussion, I just put it here for completeness and awareness of this detail. |
Beta Was this translation helpful? Give feedback.
-
What should Lance do?My current proposal is that, we should directly go with liquid clustering in Lance, rather than supporting partitioning. Here are the key reasons: Vector Index is already partitioned differentlyThis might be the most important reason that I am personally against adding partitioning. The goal of Lance ultimately is to be optimized for ML/AI workflows, not trying to be good at everything. Today the vector index is already partitioned by IVF. Adding physical partition means that we now need to have 1 IVFPQ index per partition. The vector search runtime would increase from Zone Map for Data SkippingScalar indexes today (btree, bitmap) are not good at pruning ranges. As a part of Lance file format 2.1, we have discussed that we want to move the zone map as a table level index. If we add liquid clustering, this combination would greatly improve the statistics-based pruning, which is what many OLAP users are looking for. Based on my previous experience in benchmarking Iceberg vs Hive, having 1 level zone map vs 2 level (partition level + sub-partition) has pretty minimal effect on the final performance of pruning, mostly because all the stats are loaded in memory and cached anyway and such pruning is highly parallelizable. We might also want to make sure the zone map is always collected synchronously (it is pretty cheap to do so) rather than computed asynchronously, in order to ensure the scan pruning performance is always in control. Scalar Index for Join and Merge EfficiencySome asks for partitioning come from the point of view of JOIN and MERGE efficiency. Today, we already have the optimization which leverages scalar index to allow faster join. If we do liquid clustering so that data to be updated is typically landing in a small set of files, then combined with the scalar index, the performance should be good enough without the need to introduce a hard partitioning strategy. This is a part where we need more data points. For storage partitioned join, yes we can no longer do that. But going back to the point that Lance is optimized for ML/AI workflows, I think this is okay. We also see Delta already moved to liquid clustering, so optimization rules related to such strategy would eventually come out in query engines like Spark to make it run faster. |
Beta Was this translation helpful? Give feedback.
-
I think the main thing that would challenging with liquid clustering (or what I think liquid clustering is) would be write operations like "replace all values for this cluster value". For example, imagine the user is clustered on Month and they want to replace all values for a given month. If we aren't partitioned and clustered on the same thing then we would probably need to modify all fragments. I don't know how common this type of operation is. Also, for temporal clustering, the data is often going to be naturally partitioned by the cluster key anyways so we'd probably be fine there (e.g. if we use "capture date" as the clustering key then rows will be ingested in mostly sorted order anyways) |
Beta Was this translation helpful? Give feedback.
-
I understand that the partition key itself has a function of identifying the timeline of data from the business side. Especially since Hive provides the function of automatically deleting historical partitions, I don't think the sort/cluster key can achieve this. |
Beta Was this translation helpful? Give feedback.
-
In my view, a "lazy" or metadata-driven approach cannot fully replace the necessity of physical partitioning in certain scenarios. There are clear use cases where physical partitioning remains indispensable — for example, when data from different batches, projects, geographic regions, or user groups should not be co-mingled from a business logic or compliance standpoint. If such logically separable data is instead grouped together using vector-based partitioning mechanisms like IVF (Inverted File Index), it may lead to challenges in data segregation, access control, and governance. This could, in turn, increase ETL complexity, operational overhead, and slow down model learning or inference performance. It's important to distinguish between two fundamentally different strategies for data skipping: one relies on physical partitioning (e.g., organizing data into separate directories or files), while the other leverages metadata-level optimizations such as indexing, column statistics, or clustering techniques. Each has its own trade-offs and appropriate use cases, and they should be considered complementary rather than interchangeable. |
Beta Was this translation helpful? Give feedback.
-
The conclusion seems to be an expectation to replace partitioning with clustering. However, in our practice, this is not the optimal solution. For extremely large datasets, the cost of clustering is too high, consuming a significant amount of computing resources. We rarely perform clustering on ultra-large datasets, such as a partition containing 20TB of data. Another advantage of using partitioning is the ability to manage data lifecycles. We can simply delete an outdated time partition without rewriting any data files. |
Beta Was this translation helpful? Give feedback.
-
In our scenario, partitioning is the cornerstone of data processing. For very large tables, exceeding 10B, if partitioning is not done in advance to physically segment the data, there is basically no solution to handle it. Because a large table can't be processed completely in a single ray/spark, and we also need to move data across partitions (due to GPU constraints), pre-partitioning the data and ensuring that partitions are immutable is the simplest solution. We only need to process each partition one by one. I also agree that the partition transform in iceberg is too complex, many engines support it incompletely, and the user learning threshold is very high, so within our company, we simply use identity partitioning, and only need to perform a lightweight data processing in the data ingestion step. I think Liquid Clustering is suitable for smaller tables used for end-user analysis and experiments, as the data volume is usually not large, within 1B. In this case, users do not need to worry about partitioning, and the system automatically optimizes the data, making it a very appropriate solution. |
Beta Was this translation helpful? Give feedback.
-
I've seen use cases where partitioning actually enhances vector search performance rather than degrading it. This happens when:
In these scenarios, traditional partitioning provides clear benefits:
|
Beta Was this translation helpful? Give feedback.
-
I think there are serveral challenges for no partitioning(we are facing now) Performance concern
Data governmentIn practice we encountered a lot of dataset contiously ingested and they logical belonged to one dataset and shares the same schema. The multi dataset motivation is for downstream ML flexible retrieval and performance concern. However, as @yanghua said, there has been challenges for TTL and other goverment issues on product layer. Business model decoupled read patternsI think the root problem is that there could be some gaps between business model and read patterns. For example, I have a workload(maybe ML, maybe analysis, maybe exploring) wants to read data under a certain condition(like city, day, sampling by fields), which resides in an orignal detailed dataset. Partitioning would serve as a fundamental solution to fill in the gap by physical layout. Of course there are serveral other ways to accelarete read partterns, not that fundamental in my mind. I think:
|
Beta Was this translation helpful? Give feedback.
-
My idea is that we might not need partition very much. Instead, we can achieve the same effect as "partitioning by time" by introducing some new indexes. Here is the draft idea: Date IndexIn the Date Index, we record a list of fragments for each date, allowing us to quickly locate "which fragments" contain records for a specific date. Additionally, we can add a flag to indicate whether all records in the fragments come from the same date, which will be helpful when delete by date. Actually, we already have an interface to do this.
Partitioning is most commonly used for scenarios like scanning and cleaning. With the Date Index, we can quickly locate the list of fragments for a specific date, hand them over to Spark or Ray. So scanning and cleaning might no longer be issues. Does Index Query Efficiency Decrease as the Dataset Grows?
@majin1102 mentioned an issue with reduced update efficiency. Since it is possible to update a single row in a 10TB dataset within seconds, I suspect scalar indexes are being used here. The scenario might look like this:
I noticed that BTree indexes are hierarchical/nested (please correct me if I’m wrong), so as the index grows larger, adding another layer should suffice. I’m not sure if the performance will degrade significantly. GroupIndexEven If performance does degrade significantly, we have other solutions. For example, introducing group index. GroupIndex is a composite index that requires specifying a group column and an index column during creation. For example, you can specify the date column as the group column and the label column as the index column.
I think this can solve the problem of performance degradation caused by overly large index files. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This topic has come up multiple times, since today Lance does not support partition and sort key. Just raise this discussion thread so we can centralize all the feedback here.
Beta Was this translation helpful? Give feedback.
All reactions