Fast Upsert with User-Defined Primary Key #3842
Replies: 7 comments 3 replies
-
MemTableA MemTable is implemented using local memory, at least in the initial implementation to keep thing simple, and it also works well with lancedb as an embedded engine. SchemaThe schema of the MemTable inherits the schema of the Lance table, with 2 additional system columns:
Sequence numberA MemTable has a sequence number that is strictly increasing. Every time a MemTable is flushed, the next one has a higher sequence number. In our implementation, the first MemTable will have value 0, and each new one will +1 to have sequence number 1, 2, 3, 4, … This is related to the design of the WAL storage layout. IndexesBased on the current list of indexes in the Lance table, the MemTable keeps the same list of in-memory indexes that are created synchronously as data is upserted to it. Write to MemTableUnlike traditional sorted MemTable, in order to fit Lance's index-based data format on disk, the MemTable uses a linked hash map to store PK → other row values. A write will:
|
Beta Was this translation helpful? Give feedback.
-
WALA WAL is implemented using cloud storage for high durability. This WAL is append only, newer rows are appended to the WAL at the end of the file. Storage ImplementationThe WAL will be implemented in object storage. Transaction GranularityBecause Lance writes come in batch, instead of ensuring durability for writing every row, we will only ensure durability per batch. If anything goes wrong while writing any row within a batch, the whole batch is lost. Each batch has a batch ID, stored in memory, starts with 0 and +1 after each successful batch write. File FormatEach record batch is written as an independent file. We will just dump it in Arrow exactly like how it comes from the writer. Storage LayoutThe WAL starts with a root folder. This can be in the same directory as the table, e.g. in lance table directory wal_root_dir/
...
99999999999998-0.arrow
99999999999998-1.arrow
99999999999998-2.arrow
99999999999999-0.arrow
99999999999999-1.arrow The directory records the WALs that are in the binary format. Each file follows the naming scheme In addition, In the Lance format manifest, adds a new section: message Manifest {
...
repeated uint64 flushed_memtables = 18;
...
} This records a list of flushed MemTable sequence numbers sorted in descending order. This information is persisted in manifest to ensure atomic flush of the MemTable. Write to WAL
💡 Note that in some cases maybe the writer does not want to replay the WAL because the previous unflushed write is aborted intentionally. This could be an optional flag like Trim WALWhen trimming, all WALs of already flushed MemTables are deleted, except for the latest sequence number one because that one needs to be used to derive the next sequence number of the MemTable. The trim process needs to both delete contents in the WAL directory and the |
Beta Was this translation helpful? Give feedback.
-
For simplicity, I will use the term MemWAL when I am referring to both the MemTable and WAL as a combined component/feature. Write to MemWALIn general, a write process goes through the following steps:
Partial UpsertTo work with partial upsert, the MemTable require a notion that a column in a row is partial. This is tracked in the |
Beta Was this translation helpful? Give feedback.
-
Read MemWALThere are 2 types of readers: Strongly consistent (SC)Data written to the MemWAL needs to be immediately visible to the reader. This is achieved by:
Eventually consistent (EC)Data written to the MemWAL can be invisible to the reader with a given tolerance window. This tolerance window is a property of the MemWAL shared across all EC readers since that will be the maximum interval of a MemTable flush (more details later). This is acheived by simply go read the data in the Lance dataset on storage, without combining results in the MemTable. |
Beta Was this translation helpful? Give feedback.
-
Flush MemWALCriteriaMemTable will be flushed when hitting some criteria, for examples:
WorkflowWhen flushing the MemTable, the following will happen:
Impact on Read ConsistencyBetween step 1 and step 3, there are technically 2 MemTables, making the whlole system a 3 level LSM tree. A SC reader needs to recursively apply the sealed MemTable and then the new MemTable data before returning the result. 💡 Note that if we do not flush fast enough, it is possible that the second MemTable is also filled up and need to be flushed and we get more and more queued MemTable and can never catch up. In the initial implementation, we will just say if the second MemTable is also filled up, we will stop the writer completely to keep things simple. More details in the next section of future work. |
Beta Was this translation helpful? Give feedback.
-
Future WorkIn this model, the constraint of upsert performance is: single writer, memory and flush speed. Single WriterWe will enforce a single LSM tree must be written by a single writer. To extend to more writers, we will need to add partitioning concept to Lance so we can have 1 writer per partition to scale horizontally. I guess this will eventually come in Lance anyway, so we can discuss that separately and focus on the single writer use case here. MemoryThere are some issues with memory management, for example if we write images and videos, we will easy blow up the MemTable with just a few rows. This we will try to solve in the blob storage discussion and complete the full story of blob dataset and blob file. Flush speedCurrently a flush is running a batch upsert against Lance dataset on disk. This might be slow. To flush fast, we should directly flush the MemTable to Lance, and this could be achieved by having features like multi-level fragment with sequential order (something like SSTable in Cassandra or HFile in HBase). But that will be a much larger scope change. I think if this current proposal works out, then we can explore this direction further. |
Beta Was this translation helpful? Give feedback.
-
Upsert has always been a pain point for data lake. I've spent a few years working in this area and I want to share some concerns on this:
Anyway I feel execiting about this upsert idea on Lance, which could make Lance format a more general data lake solution and I'm willing to contribute if more details are revealed |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Small, fast, frequent upsert (merge-insert + delete) is currently not optimal in Lance, mainly because of 2 issues:
Proposed Solution
Learning from Apache Hudi and Apache Paimon, I think there is a possibility to solve this problem in Lance by introducing some Log-Structured-Merge (LSM) semantics with user-defined primary key, and I would like to discuss this possibility here.
To start simple (or maybe not that simple), we can introduce concepts like primary key, WAL, MemTable, to support LSM-style reader and writer implementation. Here is how this would look like:
In short, it turns Lance into a 2 level LSM tree, where the whole on-disk Lance dataset is the second level of the tree, with an additional MemTable and WAL on top of it to enable LSM-style upserts and reads.
Beta Was this translation helpful? Give feedback.
All reactions