Partitioning (or not) for Lance #4125

jackye1995 · 2025-07-02T20:48:28Z

jackye1995
Jul 2, 2025
Maintainer

This topic has come up multiple times, since today Lance does not support partition and sort key. Just raise this discussion thread so we can centralize all the feedback here.

jackye1995 · 2025-07-02T20:48:54Z

jackye1995
Jul 2, 2025
Maintainer Author

Definitions and Analysis of the Current Solutions

Partitioning

Partitioning is the technique of dividing a large table into smaller, logical segments—called partitions—based on the values of one or more columns, known as partition keys. Each partition groups together rows that share the same key values.

Data lake formats like Hive, Iceberg and Delta supports partitioning. With partitioning, the partition column/transform dictates the exact physical layout on disk. Each data file must contain data in one and only one partition. One partition can contain multiple data files.

Partitioning is used in the following ways in query execution:

data skipping: if the filter directly against a partition column, many data files can be skipped quickly just based on partition information. This can be considered as a part of the statistics based file pruning.
certain query plan optimization: for example, when both tables in a JOIN is partitioned alike, which typically is the case especially for a MERGE, then this fully avoids a shuffle through storage partitioned join.
efficiency in write pipelines: appends, data retention, compaction, incremental processing can target a specific partition instead of the entire table.

Complaints about partitioning:

too small or big files: it is very easy to end up with too many small files or too few big files when designing the partitioning strategy. It becomes a data engineering problem for companies to tune.
skewed partition size: Some partitions are much larger than others. When the processing unit is a partition (e.g. compaction, storage partitioned join), this leads to unbalanced task sizes where the big partitions dictate the processing time.
inflexible evolution: although you could evolve the partition strategy (e.g. Iceberg partition evolution), it comes with the cost of uneven data layout and the decreased effectiveness of query optimization. For example, in a storage partitioned join, you have to join based on the most corse grain partition strategy. This plays against the typical evolution pattern of having more fine-grained partitions as data volume increases.
inhomogeneous support of partition transforms in the ecosystem: Iceberg tries to address some problems through partition transform. But supporting transform in an engine is a pretty painful process to get it fully correct across the stack, especially when you have many query optimizations leveraging partitioning. This leads to some engines support more transform types than the other. To prevent such case, we see many people in the end choosing just the least common denominator - no partition transforms.

Sorting

In the context of data lake formats, sorting is the process of ordering rows within a partition or data file based on the values of one or more sort keys. It does not affect partition structure but optimizes data layout within partitions or files.

It helps a query execution in the following ways:

data skipping: further improves the file/row group pruning effectiveness if querying a sort key column
certain query plan optimization: sorting can be leveraged to optimize plans like SortMergeJoin, TopK, RANK, order-aware filter pushdown, etc.
compression: data can be compressed better when sorted

Complaints about sorting:

limited effectiveness: sorting is only effective when there are 1 or 2 sort keys. Beyond that the effectiveness is pretty minimal. The order of the sort keys and key distribution skewness also affect the effectiveness.

Z-ordering

Z-ordering is an alternative to sorting within a partition. It improves the effectiveness when there are multiple potential sort keys, and the filter order is not deterministic. In general it is a more flexible solution but is more complicated in implementation.

Liquid Clustering

Liquid clustering is an alternative to partitioning + sorting/z-ordering. User defines the clustering keys, and the underlying process takes care of clustering data based on the clustering key. This is a logical rather than a physical layout definition, thus there is no clear relationship of data file and partition. A data file can contain data in multiple values of the clustering key, but typically the number of values is small. In Delta 3.0+ for example, the clustering mechanism is implemented using Hibert Curve.

Compared to partitioning + sorting/z-ordering, the pro of liquid clustering is:

can cluster data with any number of columns
more flexible and easier to evolve
no small or big cluster problem
no uneven cluster size
(based on what Delta claims) works better with data skipping and adaptive pruning
allow incremental optimization compared to full rewrite in sorting/z-ordering

The cons are:

certain query plan optimization: you can no longer use query optimizations that rely on physical partitioning like storage partitioned join. This would hurt its performance compared to the other formats that supports it.

Sharding

In most cases above, the read and write path are combined together. If you define a partition strategy, it means you must write according to that strategy, and read is optimized based on that strategy.

However, with liquid clustering, that data write path can be separated from read path, and liquid clustering is only a definition of how data should be clustered eventually, which has nothing to do with how data should be written to the table.

With these 2 concepts separated, now there is another consideration of - sharding, which is how should the writers be distributed. As a part of the primary key upsert proposal, we already introduced the concept of a “region”. Parallel writers could be assigned to different regions of the table based on the primary key, and some sort of range-based or hash-based strategy can be used to determine the region. This is outside the scope of this discussion, I just put it here for completeness and awareness of this detail.

3 replies

wjones127 Jul 2, 2025
Maintainer

certain query plan optimization: you can no longer use query optimizations that rely on physical partitioning like storage partitioned join. This would hurt its performance compared to the other formats that supports it.

I do wonder if we could achieve this still to a decent degree, given that we design the format to make it easy and efficient to read parts of files.

westonpace Jul 2, 2025
Maintainer

I think we can get pretty far with a very lazy approach to clustering (which might be liquid clustering, I'm not really sure of the definition). Basically, when we run compaction, we sort the data by the clustering keys. We never do rewrites. So if we are clustered on "month" then every file might have data for every month, but each file is sorted by month.

Storage partitioned join is fairly straightforward assuming both tables are clustered on the same column. First, get the range of values available (e.g. how many months are there), then divide that up into ranges (Jan-Apr), (May-July), (Aug-Oct), (Nov-Dec) and then for each range run a join.

All we need to do is be able to do scan + filter with very little read amplification which we can easily do with zone map index.

westonpace Jul 2, 2025
Maintainer

A visual

jackye1995 · 2025-07-02T20:52:01Z

jackye1995
Jul 2, 2025
Maintainer Author

What should Lance do?

My current proposal is that, we should directly go with liquid clustering in Lance, rather than supporting partitioning. Here are the key reasons:

Vector Index is already partitioned differently

This might be the most important reason that I am personally against adding partitioning. The goal of Lance ultimately is to be optimized for ML/AI workflows, not trying to be good at everything. Today the vector index is already partitioned by IVF. Adding physical partition means that we now need to have 1 IVFPQ index per partition. The vector search runtime would increase from O(logN) to O(klog(N/k)), where k is the number of partitions. Consider the fact that k is usually a increasing function of N, that means we are hurting the vector search performance if we add physical partitions within. It is basically a lot of work with negative return in the vector search workloads.

Zone Map for Data Skipping

Scalar indexes today (btree, bitmap) are not good at pruning ranges. As a part of Lance file format 2.1, we have discussed that we want to move the zone map as a table level index. If we add liquid clustering, this combination would greatly improve the statistics-based pruning, which is what many OLAP users are looking for.

Based on my previous experience in benchmarking Iceberg vs Hive, having 1 level zone map vs 2 level (partition level + sub-partition) has pretty minimal effect on the final performance of pruning, mostly because all the stats are loaded in memory and cached anyway and such pruning is highly parallelizable.

We might also want to make sure the zone map is always collected synchronously (it is pretty cheap to do so) rather than computed asynchronously, in order to ensure the scan pruning performance is always in control.

Scalar Index for Join and Merge Efficiency

Some asks for partitioning come from the point of view of JOIN and MERGE efficiency. Today, we already have the optimization which leverages scalar index to allow faster join. If we do liquid clustering so that data to be updated is typically landing in a small set of files, then combined with the scalar index, the performance should be good enough without the need to introduce a hard partitioning strategy. This is a part where we need more data points.

For storage partitioned join, yes we can no longer do that. But going back to the point that Lance is optimized for ML/AI workflows, I think this is okay. We also see Delta already moved to liquid clustering, so optimization rules related to such strategy would eventually come out in query engines like Spark to make it run faster.

3 replies

wjones127 Jul 2, 2025
Maintainer

When I thought about this earlier, I leaned toward using liquid cluster.

However, when I started thinking about supporting a primary key, I started leaning toward not supporting clustering and always clustering by primary key. I haven't really thought too much about the tradeoffs though. It could perhaps be fine to have a table with primary key but not be clustered by it.

jackye1995 Jul 2, 2025
Maintainer Author

I think it's fine to not cluster by the primary key, because primary key is typically for random access, which is already covered by its scalar index. Clustering is mainly for OLAP workloads, and users should probably choose whatever frequently used columns for those use cases are. Maybe cluster by something like hash_bucket(pk, 8) would be helpful in some cases, but users could determine that.

westonpace Jul 2, 2025
Maintainer

I agree that we want to support clustering on columns other than the primary key.

westonpace · 2025-07-02T21:33:24Z

westonpace
Jul 2, 2025
Maintainer

I think the main thing that would challenging with liquid clustering (or what I think liquid clustering is) would be write operations like "replace all values for this cluster value". For example, imagine the user is clustered on Month and they want to replace all values for a given month. If we aren't partitioned and clustered on the same thing then we would probably need to modify all fragments.

I don't know how common this type of operation is. Also, for temporal clustering, the data is often going to be naturally partitioned by the cluster key anyways so we'd probably be fine there (e.g. if we use "capture date" as the clustering key then rows will be ingested in mostly sorted order anyways)

3 replies

westonpace Jul 2, 2025
Maintainer

Note that we can still easily do distributed write tasks (we are partitioned on fragment). Only when write tasks target a small range of cluster values would we struggle. Although, if those tasks happened frequently, the dataset would perhaps grow to be naturally partitioned on the cluster key anyways.

For example, if a user replaces all values for April then we will end up with all April values in a single fragment at the end.

jackye1995 Jul 2, 2025
Maintainer Author

For example, imagine the user is clustered on Month and they want to replace all values for a given month. If we aren't partitioned and clustered on the same thing then we would probably need to modify all fragments.

I think we are saying to have no partitioning, and just liquid clustering. Whatever user used to set as partition key, the expectation is that you now set that as the cluster key. So if originally you have for example 10 files in a partition, then in the new clustering approach, you are not guaranteed to have exactly 10 files, but maybe it will be 2 or 3 files more and the files might have other partition values, but still controlled to be not spanning across all files.

westonpace Jul 2, 2025
Maintainer

I think we are saying to have no partitioning

We always have implicit partitioning on ingestion order (e.g. fragments), but maybe I'm being pedantic 😄

and just liquid clustering

My understanding of liquid clustering (feel free to correct me if I'm wrong) is that it is a combination of two things. First, dynamically picking clustering columns based on query patterns (not relevant for this discussion). Second use sort-on-write but never rewrite for the sake of sort. So, basically, during compaction, we sort our data files as we compact and rewrite them (for compaction purposes) but we don't rewrite any other files.

So, for example, let's assume we want to cluster on a "project" column so that we can do storage partitioned join. Maybe there are 20 projects.

So if originally you have for example 10 files in a partition, then in the new clustering approach, you are not guaranteed to have exactly 10 files, but maybe it will be 2 or 3 files more and the files might have other partition values, but still controlled to be not spanning across all files.

I disagree. If data continuously arrives for all 20 projects then I think every fragment will have data for all 20 projects. However, every fragment will be sorted by project.

Jay-ju · 2025-07-02T23:19:02Z

Jay-ju
Jul 2, 2025

I understand that the partition key itself has a function of identifying the timeline of data from the business side. Especially since Hive provides the function of automatically deleting historical partitions, I don't think the sort/cluster key can achieve this.
Here, I want to discuss a scenario: If the data comes in batches, such as batch1, batch2,...., after being written into a dataset, if we only want to update the data of a certain batch, currently, there doesn't seem to be a good update method? However, partitions seem to be able to achieve this?

2 replies

westonpace Jul 3, 2025
Maintainer

If the data comes in batches, such as batch1, batch2,...., after being written into a dataset, if we only want to update the data of a certain batch, currently, there doesn't seem to be a good update method? However, partitions seem to be able to achieve this?

If the data arrives in batches (ingestion order) then the implicit partition key (fragments) will work pretty well. We will probably have some amplification because of compaction. Also, deletion vectors help avoid rewrites when doing updates too.

To replace batch 3 in this example either we have to rewrite two fragments (it would only be one fragment for any other batch) or we have to create deletion vectors.

A trickier case is when the desired key is not correlated with ingestion order (project id in my example).

coolderli Jul 4, 2025

@westonpace This is indeed a method to solve batch writing, but its write performance is not better than that of partitioning. Among the users who have migrated from Hive usage, they are accustomed to using the overwrite semantics to overwrite partitions, which is simpler and has better write performance than using merge.

qidian99 · 2025-07-04T07:22:29Z

qidian99
Jul 4, 2025

In my view, a "lazy" or metadata-driven approach cannot fully replace the necessity of physical partitioning in certain scenarios. There are clear use cases where physical partitioning remains indispensable — for example, when data from different batches, projects, geographic regions, or user groups should not be co-mingled from a business logic or compliance standpoint. If such logically separable data is instead grouped together using vector-based partitioning mechanisms like IVF (Inverted File Index), it may lead to challenges in data segregation, access control, and governance. This could, in turn, increase ETL complexity, operational overhead, and slow down model learning or inference performance.

It's important to distinguish between two fundamentally different strategies for data skipping: one relies on physical partitioning (e.g., organizing data into separate directories or files), while the other leverages metadata-level optimizations such as indexing, column statistics, or clustering techniques. Each has its own trade-offs and appropriate use cases, and they should be considered complementary rather than interchangeable.

1 reply

westonpace Jul 8, 2025
Maintainer

I agree. These two things are complementary.

I personally think the benefits of partitioning are fairly small (optimizing certain write patterns) and there are risks (creating non-optimal files) so I wouldn't recommend it by default.

I see no problem in supporting partitioning in addition to clustering, if it is limited to using partitioning to control how fragments are created and compacted.

I think my main concern would be when partitions start to have features beyond storage (e.g. per-partition authorization config as mentioned in #4125 (reply in thread))

coolderli · 2025-07-04T07:23:58Z

coolderli
Jul 4, 2025

The conclusion seems to be an expectation to replace partitioning with clustering. However, in our practice, this is not the optimal solution. For extremely large datasets, the cost of clustering is too high, consuming a significant amount of computing resources. We rarely perform clustering on ultra-large datasets, such as a partition containing 20TB of data.

Another advantage of using partitioning is the ability to manage data lifecycles. We can simply delete an outdated time partition without rewriting any data files.

4 replies

yanghua Jul 4, 2025
Collaborator

Another advantage of using partitioning is the ability to manage data lifecycles. We can simply delete an outdated time partition without rewriting any data files.

Yes, it's a solid case in our customer's scenario.

wjones127 Jul 8, 2025
Maintainer

My understanding that partitioning is just clustering with extra file splits. That is, if I cluster by date, I sort the data by date. Partitioning by date is that same sorting, plus file splits along the date boundaries. Sometimes this produces a good layout, but sometimes it produces suboptimal layout.

So maybe to rephrase, would clustering be more appealing if it came with the guarantee that it’s no more computationally expensive to rearrange the data by key as it would be to partition by key?

wjones127 Jul 8, 2025
Maintainer

Also, are you advocating for just identity partitioning (each unique value gets its own partition)? Or do you think other schemes like hash partitioning or bucket partitioning have value?

westonpace Jul 8, 2025
Maintainer

However, in our practice, this is not the optimal solution. For extremely large datasets, the cost of clustering is too high, consuming a significant amount of computing resources. We rarely perform clustering on ultra-large datasets, such as a partition containing 20TB of data.

This is only a problem if you have clustering AND partitioning (maybe call it strong clustering or solid clustering). If you only have clustering (liquid clustering) then there is never a need to sort beyond a single file. So the cost should not be too high.

Another advantage of using partitioning is the ability to manage data lifecycles. We can simply delete an outdated time partition without rewriting any data files.

In many cases we can accomplish this with fragments (implicit partitioning). For example, the cleanup operation.

I agree that we can't solve this in all cases however. As mentioned in #4125 (comment) partitioning will have more efficient writes when the writes are always known to be aligned on partition boundaries.

jiaoew1991 · 2025-07-07T09:38:34Z

jiaoew1991
Jul 7, 2025

In our scenario, partitioning is the cornerstone of data processing. For very large tables, exceeding 10B, if partitioning is not done in advance to physically segment the data, there is basically no solution to handle it. Because a large table can't be processed completely in a single ray/spark, and we also need to move data across partitions (due to GPU constraints), pre-partitioning the data and ensuring that partitions are immutable is the simplest solution. We only need to process each partition one by one.

I also agree that the partition transform in iceberg is too complex, many engines support it incompletely, and the user learning threshold is very high, so within our company, we simply use identity partitioning, and only need to perform a lightweight data processing in the data ingestion step.

I think Liquid Clustering is suitable for smaller tables used for end-user analysis and experiments, as the data volume is usually not large, within 1B. In this case, users do not need to worry about partitioning, and the system automatically optimizes the data, making it a very appropriate solution.

2 replies

westonpace Jul 8, 2025
Maintainer

Can you provide a concrete operation that requires partitioning? Specifically, that requires partitioning on some column?

We have implicit partitioning with fragments. We can distribute large operations across multiple workers.

only need to perform a lightweight data processing in the data ingestion step.

Partitioning on ingestion is perfectly fine. We have a distributed ingestion process and it doesn't really care how you divide up the work.

jiaoew1991 Jul 8, 2025

For example, we have a table with 10 billion rows that need to be classified and labeled using GPUs. However, due to resource constraints, it's not possible to process all this data in a single Ray task, so we need to run it in batches. At this point, we need a column that explicitly shows the partition information, with each partition containing 1% of the data.

The labeled data at the same time may be written to a new table. When joining the new table with the original data table, a partition column is also needed to speed up the join operation.

LuQQiu · 2025-07-07T21:24:55Z

LuQQiu
Jul 7, 2025
Collaborator

I've seen use cases where partitioning actually enhances vector search performance rather than degrading it. This happens when:

Queries always include a consistent filter (e.g., store_id, document_id, tenant_id)
No cross-partition searches are needed. Each query targets exactly one partition
The partition key is known before the vector search begins

In these scenarios, traditional partitioning provides clear benefits:

Reduced search complexity: From O(log N) to O(log N/k), where k = number of partitions
Improved recall: Smaller search spaces can lead to better accuracy in vector search

14 replies

jiaoew1991 Jul 9, 2025

Actually, I've always had a question: is Lance a storage format that competes with Parquet or Iceberg? Parquet and Iceberg also correspond to different levels in the storage field. If Lance is at the storage format level (i.e., competing with Parquet), then there's no need to support partitioning; the upper-level table format can manage a Hive-style partitioning. If it's at the open table format level (i.e., competing with Iceberg), then features like partitioning, branching, and time travel are essential. 🤔

jackye1995 Jul 9, 2025
Maintainer Author

is Lance a storage format that competes with Parquet or Iceberg?

I think it's both Iceberg and Parquet, there is a file format and a table format combined together in Lance. But I do not 100% agree with the thinking that the decision should be made by what we are competing with. We should think more about what is architecturally and technically the best approach. Delta Lake has moved away from Hive style partitioning to liquid clustering, and for branching, Delta uses shallow clone which also works pretty well. So we have the leverage here to think a bit more through these options and what is the best for our targeted use cases.

Xuanwo Jul 9, 2025
Collaborator

I agree with @jackye1995 that we should not think what we are competing and decide what we need to do.

The goal for lance is clear: multi-modal data management. It's just by chance that lance shares some same feature set with parquet (the format) or iceberg (the table).

So the question in my mind is: Is (explicit) partitioning a required feature for ML/AI? Or it's just an implementation detail.

yanghua Jul 9, 2025
Collaborator

Is (explicit) partitioning a required feature for ML/AI?

I share two cases that come from our customers (both are AI scenarios).

They need a flexible data cleanup strategy: a typical strategy is based on time. If we have an explicit partition based on time, the cleanup work would be easy and effective. Currently, they split a big dataset into some small datasets. We try to solve it via metadata management(catalog) and dataset lifecycle tracing(to find unnecessary datasets).
They want to align with the Iceberg in a similar filter mechanism(based on time) and ingestion/ETL(based on time): this scenario is data-preprocessing for ML training, the background is for the text model, they have chosen Iceberg to manage their data, and for Multimodal data, they chose Lance. Without partitioning, they feel strange, and mapping a fragment to a partition seems to be a challenge.

About AI, we may think of both data preprocessing and ML training(data preprocessing would prepare for ML training). In many cases, they can not be split into two parts. And data preprocessing may supported by big-data infra team.

In short, currently, without it, they are not blocked; with it would be much better in these scenarios.

jiaoew1991 Jul 10, 2025

Is (explicit) partitioning a required feature for ML/AI?

I share two cases that come from our customers (both are AI scenarios).

They need a flexible data cleanup strategy: a typical strategy is based on time. If we have an explicit partition based on time, the cleanup work would be easy and effective. Currently, they split a big dataset into some small datasets. We try to solve it via metadata management(catalog) and dataset lifecycle tracing(to find unnecessary datasets).

They want to align with the Iceberg in a similar filter mechanism(based on time) and ingestion/ETL(based on time): this scenario is data-preprocessing for ML training, the background is for the text model, they have chosen Iceberg to manage their data, and for Multimodal data, they chose Lance. Without partitioning, they feel strange, and mapping a fragment to a partition seems to be a challenge.

About AI, we may think of both data preprocessing and ML training(data preprocessing would prepare for ML training). In many cases, they can not be split into two parts. And data preprocessing may supported by big-data infra team.

In short, currently, without it, they are not blocked; with it would be much better in these scenarios.

This is the problem we are facing now.

majin1102 · 2025-07-13T07:59:29Z

majin1102
Jul 13, 2025

I think there are serveral challenges for no partitioning(we are facing now)

Performance concern

Metadata performance: with dataset growing, fragment metadata would grow heavier and heavier to impact read and write performance. For example, when we update one row on a 100GB dataset, the response time is x00ms. While we update on row on a 10TB dataset, the response time would grow to seconds because of the metadata overhead.
Maintainenace overhead/delay: Including compaction and index syncing, we have tested that on a 4TB dataset one row update would cost 20+ minutes to sync fts index. Maybe there's something we don't figured out yet in this scenario. But I think it is clear that the dataset more big, the maintainenace cost would be more heavy and delayed.
Scan performance: As mentioned above, there are a lot of scanning scenarios would use a time window. Traditional time partitioning would benifit from a good proning mechenism.

Data government

In practice we encountered a lot of dataset contiously ingested and they logical belonged to one dataset and shares the same schema. The multi dataset motivation is for downstream ML flexible retrieval and performance concern. However, as @yanghua said, there has been challenges for TTL and other goverment issues on product layer.

Business model decoupled read patterns

I think the root problem is that there could be some gaps between business model and read patterns. For example, I have a workload(maybe ML, maybe analysis, maybe exploring) wants to read data under a certain condition（like city, day, sampling by fields）, which resides in an orignal detailed dataset. Partitioning would serve as a fundamental solution to fill in the gap by physical layout. Of course there are serveral other ways to accelarete read partterns, not that fundamental in my mind. I think:

Business model should determine the dataset design and construction
Read partterns could determine the partition design and construction
So that business model and read partterns could be designed by some decoupled way and performance and government could be covered as well.

0 replies

wojiaodoubao · 2025-07-31T02:36:42Z

wojiaodoubao
Jul 31, 2025

My idea is that we might not need partition very much. Instead, we can achieve the same effect as "partitioning by time" by introducing some new indexes. Here is the draft idea:

Date Index

In the Date Index, we record a list of fragments for each date, allowing us to quickly locate "which fragments" contain records for a specific date. Additionally, we can add a flag to indicate whether all records in the fragments come from the same date, which will be helpful when delete by date. Actually, we already have an interface to do this.

Index::calculate_included_frags

Partitioning is most commonly used for scenarios like scanning and cleaning. With the Date Index, we can quickly locate the list of fragments for a specific date, hand them over to Spark or Ray. So scanning and cleaning might no longer be issues.

Does Index Query Efficiency Decrease as the Dataset Grows?

While we update on row on a 10TB dataset, the response time would grow to seconds because of the metadata overhead.

@majin1102 mentioned an issue with reduced update efficiency. Since it is possible to update a single row in a 10TB dataset within seconds, I suspect scalar indexes are being used here. The scenario might look like this:

dataset.update(dict(label = 'cat'), where="seq_id == '10' and date='20250731'")

I noticed that BTree indexes are hierarchical/nested (please correct me if I’m wrong), so as the index grows larger, adding another layer should suffice. I’m not sure if the performance will degrade significantly.

GroupIndex

Even If performance does degrade significantly, we have other solutions. For example, introducing group index.

GroupIndex is a composite index that requires specifying a group column and an index column during creation. For example, you can specify the date column as the group column and the label column as the index column.
When creating the index:

First, group records based on the group column.
Then, create independent index files for each record group based on the index column.

I think this can solve the problem of performance degradation caused by overly large index files.

4 replies

majin1102 Jul 31, 2025

Our scenario is scaning data within a window of time. For example if we have 1 hour partition-spec and one year data, partition could prone files to hourly data. If we want to scan a day or a week(especially we want to sort by time), the proning would be effective. Besides, This proning could overlay with indexing(technically, partition is a metadata layer prone, index is data layer)
I think index could definitely take effects. But index maintainenace is also a workload not can not be igored. Let's look back to databases, it's a common sense that we build more indexes, we have more overhead on inserting, undating and deleting and of course with more space. For lance, I think the data arrival latency or let's say query consistency should be concerned. By the way, database usually has partition as well.

I think partition is some sort of simpler index and has been proved in kinds of data lake formats. And basically partition could be a cover-all solution in case that we encounter any unexpected overhead on very big dataset. It doesn't conflict with any type of index

wojiaodoubao Jul 31, 2025

Our scenario is scaning data within a window of time. For example if we have 1 hour partition-spec and one year data, partition could prone files to hourly data. If we want to scan a day or a week(especially we want to sort by time), the proning would be effective. Besides, This proning could overlay with indexing(technically, partition is a metadata layer prone, index is data layer)

After introducing the Date Index, we can quickly locate the fragments for a specific time, achieving the same effect as "hourly partitioning." Moreover, updating the Date Index in this hourly partition scenario is actually very efficient, as we only need to append a record to the index after each write. Compared to Hive, which uses directories as partitions, we are only adding a small overhead to maintain the Date Index, which I believe is acceptable.

I think index could definitely take effects. But index maintainenace is also a workload not can not be igored. Let's look back to databases, it's a common sense that we build more indexes, we have more overhead on inserting, undating and deleting and of course with more space. For lance, I think the data arrival latency or let's say query consistency should be concerned. By the way, database usually has partition as well.

Regarding the issue of index management, even if you use Hive-style partitioning, you still need to create/update indexes at the partition directory level after each data write. This is more efficient compared to updating the index for the entire dataset. But I think it is the same as Group Index.

Inspired by partition, I came up with the idea of the Group Index. As mentioned earlier, the Group Index has two columns: the group column is responsible for grouping, and the index column is responsible for indexing. The final effect is that we restrict the index to within a single group. In the example above, we restrict it to within one hour, thereby achieving the same effect as Hive partitioning.

yanghua Jul 31, 2025
Collaborator

partition is a metadata layer prone

It depends on how we realize the partition. If it's a physical partition, it's not only about metadata. Currently, it's not clear or there is no decision.

And if we choose a logical partition strategy over the current dataset. The challenge will transfer to how we aggregate the partial results over partitions if a query is not about one partition. And must a type of dataset be settled down or not when creating? There is no answer currently.

A good point raised by @wojiaodoubao seems reasonable. We can try adopting a "simplification approach" here evaluating whether the current design and our enhancements to it can address the challenges we've outlined. If feasible, what would the solution be? If they prove insufficient, we can then assess the need to incorporate the partitioning feature.

majin1102 Jul 31, 2025

Regarding the issue of index management, even if you use Hive-style partitioning, you still need to create/update indexes at the partition directory level after each data write. This is more efficient compared to updating the index for the entire dataset. But I think it is the same as Group Index.

Let's say if we only have a scanning demand by time window. I don't think it's that necessary to have an index if we already have partition. you still need to create/update indexes at the partition directory level after each data write yes if you have one vector index you have to update that index after each write. But I think use index instead of partiton means you have to maintain more than the ones you need. Considering we don't have a mature index management system, it brings complexity more than just introducing a new type of index anyway.

A good point raised by @wojiaodoubao seems reasonable. We can try adopting a "simplification approach" here evaluating whether the current design and our enhancements to it can address the challenges we've outlined. If feasible, what would the solution be? If they prove insufficient, we can then assess the need to incorporate the partitioning feature.

Yes, I think a simpler approach is better. I also think partition could be splitted to simpler steps. The potential approaches should all be considered.

It depends on how we realize the partition. If it's a physical partition, it's not only about metadata. Currently, it's not clear or there is no decision.

Yes. partition is physical. What I'm saying is that we can prone files at metadata layer if we have partition(For now I don't figure a partition implementation not like this). In contrast with index we shall get row_id back from indexes to datafiles to lookup.

Partitioning (or not) for Lance #4125

Uh oh!

jackye1995 Jul 2, 2025 Maintainer

Replies: 10 comments · 36 replies

Uh oh!

Uh oh!

jackye1995 Jul 2, 2025 Maintainer Author

Definitions and Analysis of the Current Solutions

Partitioning

Sorting

Z-ordering

Liquid Clustering

Sharding

Uh oh!

wjones127 Jul 2, 2025 Maintainer

Uh oh!

Uh oh!

westonpace Jul 2, 2025 Maintainer

Uh oh!

westonpace Jul 2, 2025 Maintainer

Uh oh!

Uh oh!

jackye1995 Jul 2, 2025 Maintainer Author

What should Lance do?

Vector Index is already partitioned differently

Zone Map for Data Skipping

Scalar Index for Join and Merge Efficiency

Uh oh!

wjones127 Jul 2, 2025 Maintainer

Uh oh!

jackye1995 Jul 2, 2025 Maintainer Author

Uh oh!

Uh oh!

westonpace Jul 2, 2025 Maintainer

Uh oh!

Uh oh!

westonpace Jul 2, 2025 Maintainer

Uh oh!

Uh oh!

westonpace Jul 2, 2025 Maintainer

Uh oh!

jackye1995 Jul 2, 2025 Maintainer Author

Uh oh!

westonpace Jul 2, 2025 Maintainer

Uh oh!

Jay-ju Jul 2, 2025

Uh oh!

Uh oh!

westonpace Jul 3, 2025 Maintainer

Uh oh!

coolderli Jul 4, 2025

Uh oh!

qidian99 Jul 4, 2025

Uh oh!

westonpace Jul 8, 2025 Maintainer

Uh oh!

coolderli Jul 4, 2025

Uh oh!

yanghua Jul 4, 2025 Collaborator

Uh oh!

wjones127 Jul 8, 2025 Maintainer

Uh oh!

wjones127 Jul 8, 2025 Maintainer

Uh oh!

Uh oh!

westonpace Jul 8, 2025 Maintainer

Uh oh!

jiaoew1991 Jul 7, 2025

Uh oh!

westonpace Jul 8, 2025 Maintainer

jackye1995
Jul 2, 2025
Maintainer

Replies: 10 comments 36 replies

jackye1995
Jul 2, 2025
Maintainer Author

wjones127 Jul 2, 2025
Maintainer

westonpace Jul 2, 2025
Maintainer

westonpace Jul 2, 2025
Maintainer

jackye1995
Jul 2, 2025
Maintainer Author

wjones127 Jul 2, 2025
Maintainer

jackye1995 Jul 2, 2025
Maintainer Author

westonpace Jul 2, 2025
Maintainer

westonpace
Jul 2, 2025
Maintainer

westonpace Jul 2, 2025
Maintainer

jackye1995 Jul 2, 2025
Maintainer Author

westonpace Jul 2, 2025
Maintainer

Jay-ju
Jul 2, 2025

westonpace Jul 3, 2025
Maintainer

qidian99
Jul 4, 2025

westonpace Jul 8, 2025
Maintainer

coolderli
Jul 4, 2025

yanghua Jul 4, 2025
Collaborator

wjones127 Jul 8, 2025
Maintainer

wjones127 Jul 8, 2025
Maintainer

westonpace Jul 8, 2025
Maintainer

jiaoew1991
Jul 7, 2025

westonpace Jul 8, 2025
Maintainer