Fast Upsert with User-Defined Primary Key #3842

jackye1995 · 2025-05-18T05:27:07Z

jackye1995
May 18, 2025
Maintainer

Small, fast, frequent upsert (merge-insert + delete) is currently not optimal in Lance, mainly because of 2 issues:

Read-before-write: in order to update a row, it is required to first get where the row is in the existing table, then write some information to indicate deletion (e.g. update a delete file), and also write the new data (all the column values including unchanged ones) as a new file. In cases like CDC streaming, such read-back behavior would be prohibitively slow.
Commit conflict: upsert conflicts with many operations including other concurrent upserts and compaction. This results in low success rate for all operations across the board when upsert is involved. Although we are working on improvements like row-level conflict resolution for concurrent merge-insert, fundamentally these conflicts are created and bulk-resolved at file level and the throughput is limited by that.

Proposed Solution

Learning from Apache Hudi and Apache Paimon, I think there is a possibility to solve this problem in Lance by introducing some Log-Structured-Merge (LSM) semantics with user-defined primary key, and I would like to discuss this possibility here.

To start simple (or maybe not that simple), we can introduce concepts like primary key, WAL, MemTable, to support LSM-style reader and writer implementation. Here is how this would look like:

In short, it turns Lance into a 2 level LSM tree, where the whole on-disk Lance dataset is the second level of the tree, with an additional MemTable and WAL on top of it to enable LSM-style upserts and reads.

jackye1995 · 2025-05-18T05:27:55Z

jackye1995
May 18, 2025
Maintainer Author

MemTable

A MemTable is implemented using local memory, at least in the initial implementation to keep thing simple, and it also works well with lancedb as an embedded engine.

Schema

The schema of the MemTable inherits the schema of the Lance table, with 2 additional system columns:

User-defined primary key (PK): this concept needs to be introduced in the Lance format. This key needs to be non-null. When upsert using this key it will trigger the LSM style optimized upsert.
_deleted: the tombstone marker that indicates a row is deleted logically
_partial: for partial upsert, more details later.

Sequence number

A MemTable has a sequence number that is strictly increasing. Every time a MemTable is flushed, the next one has a higher sequence number. In our implementation, the first MemTable will have value 0, and each new one will +1 to have sequence number 1, 2, 3, 4, … This is related to the design of the WAL storage layout.

Indexes

Based on the current list of indexes in the Lance table, the MemTable keeps the same list of in-memory indexes that are created synchronously as data is upserted to it.

Write to MemTable

Unlike traditional sorted MemTable, in order to fit Lance's index-based data format on disk, the MemTable uses a linked hash map to store PK → other row values. A write will:

Upsert the row into the map while preserving the write order.
Based on the position of the row in the linked hash map, updates all the in memory indexes accordingly. We will use a placeholder fragment ID to form a temporary row address, which will be replaced when flushing.
For a deletion, also record the row position in a separated hash set, this will be used to produce a deletion file when flushing.

0 replies

jackye1995 · 2025-05-18T05:40:12Z

jackye1995
May 18, 2025
Maintainer Author

WAL

A WAL is implemented using cloud storage for high durability. This WAL is append only, newer rows are appended to the WAL at the end of the file.

Storage Implementation

The WAL will be implemented in object storage.

Transaction Granularity

Because Lance writes come in batch, instead of ensuring durability for writing every row, we will only ensure durability per batch. If anything goes wrong while writing any row within a batch, the whole batch is lost.

Each batch has a batch ID, stored in memory, starts with 0 and +1 after each successful batch write.

File Format

Each record batch is written as an independent file. We will just dump it in Arrow exactly like how it comes from the writer.

Storage Layout

The WAL starts with a root folder. This can be in the same directory as the table, e.g. in lance table directory /_wal , or in a completely different location. Within the root directory, we have the following structure:

wal_root_dir/
  ...
  99999999999998-0.arrow
  99999999999998-1.arrow
  99999999999998-2.arrow
  99999999999999-0.arrow
  99999999999999-1.arrow

The directory records the WALs that are in the binary format. Each file follows the naming scheme {u64::MAX - sequence_number:020}-{batch_id}.arrow

In addition, In the Lance format manifest, adds a new section:

message Manifest {
    ...
    repeated uint64 flushed_memtables = 18;
    ...
}

This records a list of flushed MemTable sequence numbers sorted in descending order. This information is persisted in manifest to ensure atomic flush of the MemTable.

Write to WAL

The MemTable sequence number is determined by looking at the flushed_memtables to find the latest flushed sequence number, and then +1 to it.
In the WAL directory, try to read {u64::MAX - sequence_number:020}-0.arrow
1. if that file exists, continue replay until the last batch ID to recover the latest batch ID, then +1 on the batch ID and continue writing to it.
2. If there is no existing WAL file of that name, a new one following the naming scheme will be created and then start writing with batch ID 0.

💡 Note that in some cases maybe the writer does not want to replay the WAL because the previous unflushed write is aborted intentionally. This could be an optional flag like skip_wal_replay for the writer builder. If this flag is true, then the existing unflushed WAL file will be deleted and then a new file will be created during writer initialization time.

Trim WAL

When trimming, all WALs of already flushed MemTables are deleted, except for the latest sequence number one because that one needs to be used to derive the next sequence number of the MemTable. The trim process needs to both delete contents in the WAL directory and the flushed_memtables list in manifest.

0 replies

jackye1995 · 2025-05-18T05:42:50Z

jackye1995
May 18, 2025
Maintainer Author

For simplicity, I will use the term MemWAL when I am referring to both the MemTable and WAL as a combined component/feature.

Write to MemWAL

In general, a write process goes through the following steps:

Writes from client side writer is organized in LSM style, which means:
1. a merge-insert must be against the primary key
2. a delete writes a tombstone marker using system column _deleted
Write to WAL.
Write to the MemTable.
A write is successful if both writes to MemTable and WAL are successful.
1. If a write is not successful in WAL, the operation is considered failed. If the client does not retry after failure, the record will not be in the table.
2. If a write is not successful in MemTable (e.g. OOM issue), the operation is considered failed. The MemTable is teared down, and the next write is blocked until the WAL replay completes (assuming this time the memory issue is cleared). If the writer does not retry after failure, the record will still appear in the table so this is a “false negative failure”.
3. There are advanced techniques in products like Spanner that could remove this inconsistent state, but here we will just go with this simple HBase-style failure handling mechanism above since we will implement the writer to always retry and retry is idempotent.

Partial Upsert

To work with partial upsert, the MemTable require a notion that a column in a row is partial. This is tracked in the _partial column that takes a list of column IDs indicating which columns contain partial values that should be respected. All the other values non-primary key columns should be ignored. When a row is written with this column value as NULL, it means it is a full upsert.

0 replies

jackye1995 · 2025-05-18T05:44:58Z

jackye1995
May 18, 2025
Maintainer Author

Read MemWAL

There are 2 types of readers:

Strongly consistent (SC)

Data written to the MemWAL needs to be immediately visible to the reader. This is achieved by:

read the MemTable and the Lance table concurrently.
combine results. Because the Lance table is not guaranteed to be sorted, traditional 2-way merge sort of results could be inefficient. What we will implement initially will be to form a hash map of rows selected in MemTable (primary key → upsert content), and for each row selected in Lance table, find any hit in the hash map to apply the corresponding upsert.

Eventually consistent (EC)

Data written to the MemWAL can be invisible to the reader with a given tolerance window. This tolerance window is a property of the MemWAL shared across all EC readers since that will be the maximum interval of a MemTable flush (more details later). This is acheived by simply go read the data in the Lance dataset on storage, without combining results in the MemTable.

0 replies

jackye1995 · 2025-05-18T05:48:15Z

jackye1995
May 18, 2025
Maintainer Author

Flush MemWAL

Criteria

MemTable will be flushed when hitting some criteria, for examples:

if the maximum flush interval has been reached (e.g. 1 minute)
if the maximum flush size has been reached (e.g. 16MB)
if the system is low on memory that forces a flush
if there is a schema change (e.g. added a column)
some storage system limits have been reached (e.g. S3 throttling is becoming too bad)

Workflow

When flushing the MemTable, the following will happen:

the current MemTable is marked as sealed, and a new one with the +1 sequence number is atomically opened (write to object storage with Put-If-None-Match) for receiving new writes with a higher sequence number
perform a flush operation against Lance dataset, which atomically:
1. delete all of the rows in the on-disk dataset with the same primary key
2. insert the current MemTable to the dataset as a new fragment.
3. if there are deleted rows, also write a deletion file in the fragment
4. write all in-memory indexes as delta indexes of the dataset, replace the placeholder fragment ID to the actual fragment ID.
5. write to flushed_memtables section of the table manifest indicating the flush is successful.
Once the dataset has been committed, dataset version is updated, all reads query the new version Lance dataset and can now read only the new MemTable.
Discard the sealed and flushed MemTable.

Impact on Read Consistency

Between step 1 and step 3, there are technically 2 MemTables, making the whlole system a 3 level LSM tree. A SC reader needs to recursively apply the sealed MemTable and then the new MemTable data before returning the result.

💡 Note that if we do not flush fast enough, it is possible that the second MemTable is also filled up and need to be flushed and we get more and more queued MemTable and can never catch up. In the initial implementation, we will just say if the second MemTable is also filled up, we will stop the writer completely to keep things simple. More details in the next section of future work.

0 replies

jackye1995 · 2025-05-18T05:54:36Z

jackye1995
May 18, 2025
Maintainer Author

Future Work

In this model, the constraint of upsert performance is: single writer, memory and flush speed.

Single Writer

We will enforce a single LSM tree must be written by a single writer. To extend to more writers, we will need to add partitioning concept to Lance so we can have 1 writer per partition to scale horizontally. I guess this will eventually come in Lance anyway, so we can discuss that separately and focus on the single writer use case here.

Memory

There are some issues with memory management, for example if we write images and videos, we will easy blow up the MemTable with just a few rows. This we will try to solve in the blob storage discussion and complete the full story of blob dataset and blob file.

Flush speed

Currently a flush is running a batch upsert against Lance dataset on disk. This might be slow. To flush fast, we should directly flush the MemTable to Lance, and this could be achieved by having features like multi-level fragment with sequential order (something like SSTable in Cassandra or HFile in HBase). But that will be a much larger scope change. I think if this current proposal works out, then we can explore this direction further.

1 reply

kiwilab-cn Jun 24, 2025

Does a single writer imply a single reader? Since data must be written to both the WAL and the memtable simultaneously, does the memtable have to reside on the same node as the writer?

Using RecordBatch to directly store WAL files on S3 is indeed a simplified design, but in environments with poor network I/O, wouldn't this introduce higher latency? Is storing to S3 intended to enable stateless service restarts and free migration? Why not use local disks instead, with dual writes to ensure high availability?

majin1102 · 2025-05-28T10:43:58Z

majin1102
May 28, 2025

Upsert has always been a pain point for data lake. I've spent a few years working in this area and I want to share some concerns on this:

Memtable/wal is very coupled with computing engines. It's more like an implementation than a format. Paimon doing this is quite reasonable because they are flink team as well (maybe it's a choice on lancedb layer, Read-write separation deployment may need to be considered).
As far as I know Paimon treats LSM table and log table(append-only) as different formats to catch up with Iceberg features(From my view, if primary key is introduced, a lot of trade off need to consider and all engines need to match). Hudi seems not distinguishing formats on primary key, but you know, a lot of if-else and it really leads to a mess.
As you have metioned in other discussions, there are solutions like using branch to handle upsert(a stream branch, a base branch, compaction service). This could solve upsert issues to some extent without intruducing the complexity of primary key.
Paimon has the roadmap of integrating lance file format. for streaming ETL cases developers could choose to use Paimon for ingestion and generate a lance table metadata synchronously or asynchronously(just brainstorm. there still be a big gap)

Anyway I feel execiting about this upsert idea on Lance, which could make Lance format a more general data lake solution and I'm willing to contribute if more details are revealed

2 replies

westonpace May 28, 2025
Maintainer

1 & 2 are good points. I think the memtable/wal are separate component(s) that can build on top of the table, but don't necessarily need to be utilized. Lance (and especially lancedb) is already a collection of pieces to make up a database. So I agree it would be good to keep them as separate abstractions (though maybe there is some reason this can't be done?)

3 - Yes, I think we want to continue doing as much as possible to make upserts fast without the complexity of primary key or WAL. @wjones127 has done a lot of good work recently reducing the number of IOPS per transaction. In addition, there is work being done to allow deletion vectors to be merged on commit conflicts. This would make it so that merge insert operations have row-level conflict detection instead of fragment-level conflict detection.

Lastly, I think something we could (and should) make it easier to do merge_insert in bulk. We already accept a stream of data but it is often difficult for users to provide their updates as a single stream. Branching is an interesting idea. The branch would have no conflicts and then could be merged into the main branch after all the updates are made. A lighter weight alternative is to fix up the merge routine and expose it to lancedb. This is an old lance feature that does an upsert by using another dataset as input. The user can prepare the changes on a standalone dataset and then merge it into the main dataset in a single operation.

4 - Awesome, another great reasons for memtable/wal to be independent / optional.

majin1102 Jun 5, 2025

1 & 2 are good points. I think the memtable/wal are separate component(s) that can build on top of the table, but don't necessarily need to be utilized. Lance (and especially lancedb) is already a collection of pieces to make up a database. So I agree it would be good to keep them as separate abstractions (though maybe there is some reason this can't be done?)

Thanks for feeding back. I think memtable/wal is quite difficult to be abstracted to format layer or another layer like format. First memtable is inside some service or engine and can not(or don't need) be shared naturely. Second WAL is usually used to recover memtable(this is the usual case in traditional databases) and maybe it could be used in merge-on-read(means a row-column mixed read, the performance is really unacceptable). So no matter memtable or WAL it's more like a service or engine internal impl/adoption. moreover, it's hard to imagine trino would adopt LSM structure in paimon because trino dosn't do streaming usually. Different services or engines may have varying priorities when adopting such solutions. Or let's say we can make the abstraction beyond lance format and a seperated project as a componet of LanceDB. The concern would be the general applicability out of box and the potential maitainenance cost(paimon lsm layer is under table format, that is quite another story). Only my view for reference

Yes, I think we want to continue doing as much as possible to make upserts fast without the complexity of primary key or WAL. @wjones127 has done a lot of good work recently reducing the number of IOPS per transaction. In addition, there is work being done to allow deletion vectors to be merged on commit conflicts. This would make it so that merge insert operations have row-level conflict detection instead of fragment-level conflict detection.

That's great. I have read related discussions. I will dive deeper into transaction. I think multi-statement transaction including schema evalution is quite necessary.

Fast Upsert with User-Defined Primary Key #3842

Uh oh!

Uh oh!

jackye1995 May 18, 2025 Maintainer

Proposed Solution

Replies: 7 comments · 3 replies

Uh oh!

jackye1995 May 18, 2025 Maintainer Author

MemTable

Schema

Sequence number

Indexes

Write to MemTable

Uh oh!

jackye1995 May 18, 2025 Maintainer Author

WAL

Storage Implementation

Transaction Granularity

File Format

Storage Layout

Write to WAL

Trim WAL

Uh oh!

jackye1995 May 18, 2025 Maintainer Author

Write to MemWAL

Partial Upsert

Uh oh!

jackye1995 May 18, 2025 Maintainer Author

Read MemWAL

Strongly consistent (SC)

Eventually consistent (EC)

Uh oh!

jackye1995 May 18, 2025 Maintainer Author

Flush MemWAL

Criteria

Workflow

Impact on Read Consistency

Uh oh!

jackye1995 May 18, 2025 Maintainer Author

Future Work

Single Writer

Memory

Flush speed

Uh oh!

kiwilab-cn Jun 24, 2025

Uh oh!

Uh oh!

majin1102 May 28, 2025

Uh oh!

westonpace May 28, 2025 Maintainer

Uh oh!

Uh oh!

majin1102 Jun 5, 2025

jackye1995
May 18, 2025
Maintainer

Replies: 7 comments 3 replies

jackye1995
May 18, 2025
Maintainer Author

jackye1995
May 18, 2025
Maintainer Author

jackye1995
May 18, 2025
Maintainer Author

jackye1995
May 18, 2025
Maintainer Author

jackye1995
May 18, 2025
Maintainer Author

jackye1995
May 18, 2025
Maintainer Author

majin1102
May 28, 2025

westonpace May 28, 2025
Maintainer