Stable Row ID #3694
Replies: 6 comments 13 replies
-
CC @wjones127 who authored this proposal (I simply copied it over) |
Beta Was this translation helpful? Give feedback.
-
After thinking about this problem for some time, I feel there is an alternative we could evaluate here and compare the tradeoffs for solving the issue of secondary index out of date when rows have moved. I call this approach remap separation. Remap SeparationProblem AnalysisToday, in order to compensate for the fact that secondary indexes become out of date when rows have moved, we remap all the old row addresses to new ones, and commit the newly remapped indexes as a part of the compaction. This has a few issues:
SolutionSince the root cause of this issue is the remap process during compaction, one alternative way we could solve the issue is to just separate the remap process out from the compaction process, so compaction only compact files. There are 2 challenges we need to overcome in this solution:
Old Fragment Reuse IndexTo solve these 2 challenges, we will let compaction produce a new index, which we call Old Fragment Reuse (OFR) index. The idea is that after a compaction process have compacted some old fragments to new fragments, we record:
Note: we don't want to reuse the transaction file because:
Because there might be multiple compactions before a remap happens, we will store one fragment map and row ID map for each version. This results in the following proto definitions: message OFRIndexDetails {
message NewFragmentSignature {
uint64 fragment_id = 1;
// if the number of rows have changed, the fragment has logically changed (e.g. new deletes)
uint64 num_rows = 2;
}
message Replacement {
repeated DataFragment old_fragments = 1;
repeated NewFragmentSignature new_fragment_signatures = 2;
}
message VersionMapEntry {
uint64 dataset_version = 1;
repeated Replacement replacements = 2;
// The path to the row ID mapping file for this version, relative to the index root
string row_id_map_path = 3;
}
message VersionMapEntries {
repeated VersionMapEntry entries = 1;
}
oneof version_map {
// If small (< 200KB), store inline
VersionMapEntries entries = 1;
ExternalFile external_file = 2;
}
} Change to compaction processWe will introduce a flag in
Remap Index OperationWe will introduce a new index operation in /// Remap an index using the OFR index information
async fn remap_index(&mut self, index_id: &Uuid) -> Result<()>; Specifically, it will:
Remap index as a part of reindexingBecause remap index will conflict with a reindexing process, we can do an optimization to run them together. To achieve that, we add an option Trim OFR IndexThe OFR index needs to be trimmed regularly. When all indexes have a A few considerations:
Using OFR Index at read timeNow comes the important part. We want to leverage the OFR index at read time, so that index is still able to work on newly compacted fragments although these fragments are not in the index. The idea is the following:
Compared to the stable row ID approach, this seems to be more efficient both from space and runtime perspective:
Change to GC processIn this approach, we also need to not delete old fragments until it is removed from OFR index. After thoughts on stable row IDI looked into how systems like Postgres and Oracle (traditional database systems that are more on the OLTP side) solves such issue since they also have indexes that need to be updated when data is vacumed. Looks like most such database systems use physical row address instead of a globally stable row ID for indexes, and does something similar to the asynchronous remapping with a temp OFR index to serve traffic in the mean time. It is definitely going to make the query planning and execution phase more complex, but it seems like the right direction to go here. I understand systems like Snowflake, Iceberg (and Delta if my memory is right, not super sure though) uses a global counter stable row ID approach, but I think that is more OLAP focused, with use cases like CDC incremental view. So overall I feel remap separation might be a better approach for solving the index coverage problem after compaction, but we should definitely explore if we could leverage a stable row ID for use cases like CDC incremental view. What do you think? @westonpace @wjones127 |
Beta Was this translation helpful? Give feedback.
-
Yeah I was thinking about if we need that, my current conclusion is we need it. Consider the following setup, the table has fragments I originally thought about using a bitmap for the check, but I ended up just use the original mapping structure because we have to check that the scan covers both 5 and 6, and also 5 and 6 has not changed (e.g. deletion file has updated), only when these 2 conditions are true we can swap to old fragments 1,2,3,4. For |
Beta Was this translation helpful? Give feedback.
-
Yes. The alternative is to have multiple delta indexes one per version. I actually haven't thought about the pros and cons of that (because I did not read into the delta indexes part until probably yesterday...) Maybe the details structure can be simplified a bit if we use multiple delta indexes, every compaction produces a separated delta of the OFR index and when we read we combine them. The tradeoff is probably if there are too many old fragments, then we will use an external file, and with delta indexes there could be multiple external files increasing IO. The tradeoff is faster compaction completion time, but that process is already async so having a single index seems to be a bit more advantageous. |
Beta Was this translation helpful? Give feedback.
-
Actually can you explain a bit more about these 2 options. I don't quite get why there are 2 options. When I say "use old fragments", the goal is to leverage the index for random access because old fragments are likely indexed. So when you say "Approach 1 could be slow because it still needs to scan the uncompacted data.", it seems to be opposite to my expectation, shouldn't it be faster because the uncompacted data is indexed? (if it is not indexed, then we will just scan compacted data) |
Beta Was this translation helpful? Give feedback.
-
Just to provide an update on this, the feature I described above has been merged, by introducing the fragment reuse index (#3847) and audi-remap feature (#3971), so now compaction will not conflict with index update, and has minimal impact on read performance. I think it would be great if we can now discuss about how we should push the original stable row ID effort forward. One great use case I can think of is to use it to track row lineage. This would be similar to the Iceberg row lineage feature (https://iceberg.apache.org/spec/#row-lineage), where we can know exactly how a row has been changing across the table versions, and have a CDC incremental view of the data when we want. This will require move and update stable row ID, so some additional work is needed to make it update stable. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Lance currently has no stable row ids / primary keys. This is a UX challenge: once a row is inserted, it’s not obvious how to get it back, update it, or delete it. It also creates challenges for keeping secondary indices up-to-date. If a the “row id” the secondary index points to changes, we need to update and re-write the secondary index.
In practice, these two challenges are somewhat separate:
Therefore, to solve this incrementally, we can first solve the secondary index problem. Then later solve the primary key problem.
How Lance works now
Right now, Lance has the concept of the
_rowid
. It is a 64-bit values that is the combination of two 32-bit numbers: thefragment_id
the row is in and theoffset
the row is in that file. The_rowid
provides easy constant-time access to any row. But as soon as that row moves, whether due to compaction or updates, this id changes. Because the_rowid
field represents the physical location of the row, for the rest of this document we will call it the “row address” so we can differentiate it from the stable row id.Goals
What are the outcomes that will result from these changes? How will we evaluate success for the proposed changes?
Non-Goals
To narrow the scope of what we're working on, outline what this proposal will not accomplish.
Proposed Solution
Describe the solution to the problems outlined above. Include enough detail to allow for productive discussion and comments from readers.
TLDR:
Auto-incrementing u64 id
Row ids will be incrementing u64 id. Assignment of the row ids will be similar to how fragment ids are assigned: We will keep a high water mark in the manifest. During the commit loop, we will save the used row ids in the fragment metadata (more detail on how this works soon).
The high water mark will be a new field
max_row_id
, which will work similar tomax_fragment_id
.Row id indices inside fragment metadata
To make random access fast, we will need a secondary index on row id. This will map from row id to row address. We will update this index with every write.
The row id index will have separate code paths from other indices. Each fragment will have a separate piece of the index and these pieces will be combined in memory. Because the row addresses within a fragment are a simple sequence, we can serialize the index as an array of row ids. In other words, they values of the mapping are always
0..physical_rows
, so there’s no need to serialize that part.The row id index could be encoded to be very small and written inline in the manifest. For example, a fragment that was just inserted would just contain a contiguous sequence of row ids, so it just needs to record the start and end of the range. If the index ever got large enough, it could be written to another file. The encoding scheme is an implementation detail, but ideally it should be encoded both on-disk and in-memory.
Replace deletion files with tombstones in row id index
To reduce the number of files needed, deletion files will be eliminated, and instead deletions will be marked in the row id index. This eliminates extra IOs, since whenever the index is read the deletion file would have also need to be read. We can encode the deleted rows as a sentinel value.
Example operations
Let’s review some operations and how they would write row ids:
0..99
.max_row_id
is now 99.100..149
.max_row_id
is now 149.0..119, 121.149
. Notice how because we use an incrementing range and kept it sorted by row id, we could combine the ranges0..99
and100..119
into a single range.Reserving row ids
Some systems, like a write-head log, might want to reserve a set of row ids ahead of time. For example, they might reserve 1,000 row ids. When it receives an insert transaction, it assigns one of it’s reserved row ids to that row. When it receives an update transaction on a row that hasn’t been flushed, it can make that update reference the row id that’s been reserved and assigned. Once it is ready to flush a row, it can do so and immediately write the row ids safely, knowing they are already reserved.
We can do this with the same mechanism as we use for reserving fragment ids. You can commit a transaction requesting a certain number of rows, and then from the transaction you can get the range based on how much
max_row_id
changed.Time and Space Complexity of the index
When the manifest is read, readers will combine these per-fragment indices into an in-memory index data structure. The space and time complexity will be sensitive to how many ranges there are stored in this rowid index. It can be kept very fast if the row ids are always full contiguous ranges. But once there are a lot of holes and jumps, it will degrade in performance. Space complexity will become
O(num_rows)
(up to 16GB for 1 billion rows, assuming a u64 for key and value and completely random ordering). Time complexity could still be pretty fast if we use [interpolation search](https://en.wikipedia.org/wiki/Interpolation_search) (average complexityO(log(log(n)))
but worst caseO(n)
). There’s also hash tables forO(n)
.The upshot of using incrementing ids is sets of new rows can just be represented by a range. In addition, when we compact we can sort by id, potentially merging ranges. This will happen naturally in a LSM tree for an append only table where we keep the fragments clustered by row id.
When updates happen, the rowids will be moved to new fragments, creating holes in the original fragments. This will increase the size of the rowid index. We can have merge strategies prefer to merge together ranges of rowids, so that those holes will later be filled in upon compaction.
The most pathological is deletion. Deleting rows will create holes in the row ids, which can never be filled in. (If we are making a public row id, we shouldn’t re-use ids.)
ℹ️ **Other clustering schemes.** We’ve contemplated having other clustering schemes, including HNSW order (for fast vector search take) or based on filter columns (for fast filtering). These other clustering schemes will degrade the performance of the row id index by making it much larger. We could try to make an alternative row id assignment scheme that aligned well to a clustering. In order for it to work, it would need to have a mechanism to prevent duplicate ids and allow for reserving ids ahead of time (or some other affordance for a WAL).2Backwards compatibility
Initially, stable row ids will be an opt-in feature. This will let us experiment with and test this on
main
. A new feature flag will be added to the format, which will prevent older readers and writers from trying to use the dataset.If we wanted to, we could make a method to transition an existing table. It would use the existing row addresses as the row ids for all rows currently in use.
Indices
Can we re-use the current row address logic for row ids?
Conclusion: as long as the logic doesn’t rely upon the decomposition of the row address, we should be able to re-use the logic. We can search for
>> 32
and<< 32
to help find these spots.Fragment coverage bitmaps
Right now, we make certain assumptions about how indices cover fragments:
This second one could now be violated if fragments are compacted:
This should be fine for now, but this will get more complex when we tackle primary keys
Pre-filtering
Previously, when we saw tombstoned rows or missing fragments, we can easily compute the row ids that should be considered missing. With move-stable row ids, it’s safe to say the tombstoned rows are deleted (though this will change in the Primary keys / fully-stable row ids). But with missing fragments, it’s ambiguous whether the row ids were deleted or moved. It’s also unclear which row ids should be considered affected, since we can’t just mask the fragment id part of the row id anymore.
The pre-filter will need to be replaced by the row id index combined with the tombstone bitmap. To check whether a row should be filtered, we should check they exist in the index, but not in the tombstone bitmap. We could try translating this into a pre-filter to collapse masks if we need to. We could optimize conversion of row id sequences into the pre-filter.
Risks
Highlight risks so your reviewers can direct their attention here.
Beta Was this translation helpful? Give feedback.
All reactions