Branching in Lance #3861
Replies: 8 comments 8 replies
-
I think we'd need to probably use the branch name in the manifest name. So, to commit to the latest main today you commit to History tracking could be problematic if a version is deleted through cleanup. So if a manifest is the the base for another branch then we can't delete that version.
This is an interesting idea. I think operations could get complicated in this version too since we'd need to access potentially multiple data folders for operations. |
Beta Was this translation helpful? Give feedback.
-
Yeah it does not change too much about history getting complicated, just lifting it to be across multiple tables instead of being within a table, which has its own complications. The biggest reason I was thinking about this is that many people wanted to use branching in Iceberg for streaming data pipelines, e.g. I write temporarily to a streaming branch. Some users can query the streaming branch but that has small files and suboptimal perf. The main branch keeps having the compacted data from the streaming branch, but data is a bit delayed. But that kind of setup has a compatibility problem, mainly in the platforms that don't really allow a full SQL or code input, for no-code users. Most platforms allow user to input the table name, whereas this requires table name + branch name. That basically means you can't access your branch in that platform unless that platform specifically support the branch concept natively. Compared to that, if it is another table, then all platforms are basically automatically being supported. I think Delta actually uses an approach similar to the ref table, it's called shallow clone. But I don't know if it is advanced enough to stop GC in the source table, at least it is not in the OSS version, I have not tried the enterprise version to verify. |
Beta Was this translation helpful? Give feedback.
-
I try to explore the spec details of this approach metioned by Jack. Please let me know if anything is missed. The new format layout with branches may look like:
Layout added:
This layout could reach the goals:
Somehow there are several issues in this approach:
I haven't given much thought to this approach yet. I'd be very open to trying your solution if we could discuss it further @westonpace I could provide a detailed document if we align the core approach. @jackye1995 @westonpace The branch feature is indeed very useful and has proven effective in ByteDance's ML platform (built on Iceberg). I really look forward to contributing to this feature and seeing it integrated into our data products soon. |
Beta Was this translation helpful? Give feedback.
-
I was chatting with a few Iceberg and Delta committers during the Databricks Summit last week, and I am feeling increasingly strong that a shallow table clone seems to be the easiest way to offer a branching experience. It does not have the complexity of internal branching of Iceberg, and a separated table can enjoy all the features that are built in to tables (e.g. access control, discovery, UI/UX, maintenance). And this can actually be achieved pretty easily. Suppose you have table A, and you want to create a temp branch to do experiments. You can basically create a table B with the manifest of A at a specific version (let's say version X). On A's side, all we really need to do is to add a tag at version X, so that its data will not be deleted. It doesn't really need to have a knowledge of inter-table dependency. We could make the tag name like On B's side, I think there are 2 gaps to fill: Relative vs absolute pathLance reference everything with relative path, but here we need to copy the version X manifest to B's directory, and also modify X to X' where X' has all the paths as absolute paths pointing back to files in A. This requires some thorough examination in the current implementation to make sure we can accept both a absolute and relative path for reading. CDC view of a tableIf we write to this experimental table B, likely we will want to write the data back to A if things look good. The user could either rerun the pipeline, or it would be nice for user to have a CDC view of what has changed in B from version X to the latest version in B, and then apply the same change to A. @westonpace @majin1102 thoughts? |
Beta Was this translation helpful? Give feedback.
-
We implemented shallow clone in Iceberg, before branching was available, and it resulted in a lot of complexity around file cleanup, i.e. you need to keep track of all tables that reference shared files. |
Beta Was this translation helpful? Give feedback.
-
Trying to have a side-by-side comparison:
The point I have been trying to make is that, whatever operation in branching, it needs special integration from the calling engine. It is for sure possible to achieve, but it is also where I see the main adoption blocker. For Iceberg, it was basically Spark supporting everything and other engines/SDKs support only small set of operations. I hope we don't fall into the same situation here. |
Beta Was this translation helpful? Give feedback.
-
With that being said, I think I might have found a way to combine these 2 approaches together, by allowing a table to be created within another table (still very early thinking): Basically, we can use shallow clone as the core logic, and branching is another layer of better user experience on top. A shallow clone can clone a table to a new location, or a location within a table. For later, a For a branch unique experience, we further make sure that:
|
Beta Was this translation helpful? Give feedback.
-
Thank to @jackye1995 and @bryanck for the highly detailed input |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Copying the discussion from Discord
majin1102 — 5:42 AM
Hi community, branch is a good feature of Apache Iceberg. In ByteDance's machine learning platform, Iceberg's branch is used for algorithm experimentation and data delivery. Although Lance already supports tags, we still see scenarios requiring branches. Additionally, I found that the Iceberg branch feature was contributed by Jack Ye(https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit?tab=t.0). Does Lance Format plan to introduce branch in the future? @jackye1995 @westonpace
Jack Ye — 8:44 AM
Branching is definitely worth discussing, and ML experimentation was an important reason originally for doing branching. For Iceberg it is a bit easier since each new table version is a random UUID, and Iceberg just delegates the responsibility of which version is latest to the catalog. For Lance, the version is strictly increasing and tied to the file name, that remvoes the complexity of an extra catalog layer for version resolution, but it as a result also enforces a linear history of the table. If we want to do branching in Lance, we probably need to develop some additional semantics to store a branch as a separated linear history with its own versioning.
Another alternative approach to consider (just brainstorming): create a "ref table" concept in Lance instead of doing branch, so (1) a ref Lance table can be created, whose manifest can be a pointer to another source table and evolve from there, (2) the source table is also updated to know there is a ref table pointing at that specific version, so it should not clean up the data. This might be more friendly to the Lance style linear table version. It also solves some complaints with branch that (1) branching makes the internal table version structure too complicated hard to manage, (2) it is hard for tools and engines to adopt to an additional layer of branch within table, compared to expose that as another table, the table approaches gets native tooling support out of the box.
Beta Was this translation helpful? Give feedback.
All reactions