Branching in Lance #3861

jackye1995 · 2025-05-22T15:48:55Z

jackye1995
May 22, 2025
Maintainer

Copying the discussion from Discord

majin1102 — 5:42 AM
Hi community, branch is a good feature of Apache Iceberg. In ByteDance's machine learning platform, Iceberg's branch is used for algorithm experimentation and data delivery. Although Lance already supports tags, we still see scenarios requiring branches. Additionally, I found that the Iceberg branch feature was contributed by Jack Ye(https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit?tab=t.0). Does Lance Format plan to introduce branch in the future? @jackye1995 @westonpace

Jack Ye — 8:44 AM
Branching is definitely worth discussing, and ML experimentation was an important reason originally for doing branching. For Iceberg it is a bit easier since each new table version is a random UUID, and Iceberg just delegates the responsibility of which version is latest to the catalog. For Lance, the version is strictly increasing and tied to the file name, that remvoes the complexity of an extra catalog layer for version resolution, but it as a result also enforces a linear history of the table. If we want to do branching in Lance, we probably need to develop some additional semantics to store a branch as a separated linear history with its own versioning.

Another alternative approach to consider (just brainstorming): create a "ref table" concept in Lance instead of doing branch, so (1) a ref Lance table can be created, whose manifest can be a pointer to another source table and evolve from there, (2) the source table is also updated to know there is a ref table pointing at that specific version, so it should not clean up the data. This might be more friendly to the Lance style linear table version. It also solves some complaints with branch that (1) branching makes the internal table version structure too complicated hard to manage, (2) it is hard for tools and engines to adopt to an additional layer of branch within table, compared to expose that as another table, the table approaches gets native tooling support out of the box.

westonpace · 2025-05-23T13:01:04Z

westonpace
May 23, 2025
Maintainer

If we want to do branching in Lance, we probably need to develop some additional semantics to store a branch as a separated linear history with its own versioning.

I think we'd need to probably use the branch name in the manifest name. So, to commit to the latest main today you commit to X.manifest. If we want to commit to the latest mybranch we could write the manifest to X.mybranch.manifest. Manifests could store their parent manifest for version tracking.

History tracking could be problematic if a version is deleted through cleanup. So if a manifest is the the base for another branch then we can't delete that version.

Another alternative approach to consider (just brainstorming): create a "ref table" concept in Lance instead of doing branch, so (1) a ref Lance table can be created, whose manifest can be a pointer to another source table and evolve from there, (2) the source table is also updated to know there is a ref table pointing at that specific version, so it should not clean up the data.

This is an interesting idea. I think operations could get complicated in this version too since we'd need to access potentially multiple data folders for operations.

0 replies

jackye1995 · 2025-05-23T16:23:57Z

jackye1995
May 23, 2025
Maintainer Author

This is an interesting idea. I think operations could get complicated in this version too since we'd need to access potentially multiple data folders for operations.

Yeah it does not change too much about history getting complicated, just lifting it to be across multiple tables instead of being within a table, which has its own complications.

The biggest reason I was thinking about this is that many people wanted to use branching in Iceberg for streaming data pipelines, e.g. I write temporarily to a streaming branch. Some users can query the streaming branch but that has small files and suboptimal perf. The main branch keeps having the compacted data from the streaming branch, but data is a bit delayed.

But that kind of setup has a compatibility problem, mainly in the platforms that don't really allow a full SQL or code input, for no-code users. Most platforms allow user to input the table name, whereas this requires table name + branch name. That basically means you can't access your branch in that platform unless that platform specifically support the branch concept natively. Compared to that, if it is another table, then all platforms are basically automatically being supported.

I think Delta actually uses an approach similar to the ref table, it's called shallow clone. But I don't know if it is advanced enough to stop GC in the source table, at least it is not in the OSS version, I have not tried the enterprise version to verify.

0 replies

majin1102 · 2025-05-27T18:00:24Z

majin1102
May 27, 2025

Another alternative approach to consider (just brainstorming): create a "ref table" concept in Lance instead of doing branch, so (1) a ref Lance table can be created, whose manifest can be a pointer to another source table and evolve from there, (2) the source table is also updated to know there is a ref table pointing at that specific version, so it should not clean up the data. This might be more friendly to the Lance style linear table version. It also solves some complaints with branch that (1) branching makes the internal table version structure too complicated hard to manage, (2) it is hard for tools and engines to adopt to an additional layer of branch within table, compared to expose that as another table, the table approaches gets native tooling support out of the box.

I try to explore the spec details of this approach metioned by Jack. Please let me know if anything is missed.

The new format layout with branches may look like:

/path/to/dataset:
    data/*.lance  -- Data directory
    ref.json.   -- Reference table keeps all branch references and retention polices
    branch1:  -- Branch directories
        _versions/*.manifest
        _transactions/*.txn
        _indices/{UUID-*}/index.idx 
    _versions/*.manifest -- Manifest file for each dataset version.
    _indices/{UUID-*}/index.idx -- Secondary index, each index per directory.
    _deletions/*.{arrow,bin} -- Deletion files, which contain ids of rows that have been deleted.
    _transactions/*.txn

Layout added:

A json file contains the branch reference table with retention polices for each branch (like the refs object in Iceberg metadata file)
Directories for every single branch. Each directory is related to a single branch and could be treated as an independent dataset(could use root/branch_name as a dataset url). Each branch has its own linear history and the version producing procedure is unchanged(keep simple and compatible)

This layout could reach the goals:

The ref.json keeps the lineage of all branches of all versions so that clean procedures could perform accordingly, reach the goal of

The source table is also updated to know there is a ref table pointing at that specific version, so it should not clean up the data. This might be more friendly to the Lance style linear table version.

Branches are achieved by an independent file refs.json(or the branch path, according the way you used), which could be totally decoupled from manifest operations. We could manipulate refs.json only we do branch operations through APIs. This approach brings 100% compatible to old format without branches and avoid the situaion

branching makes the internal table version structure too complicated hard to manage

Every branch could be treated as a single dataset by its absolute path, which brings native tooling support out of box. Solve the issue:

it is hard for tools and engines to adopt to an additional layer of branch within table, compared to expose that as another table, the table approaches gets native tooling support out of the box

The final benifits:
-- Branches could share data files under root/data
-- Branches are managed by a well defined dataset api and easy to govern
When we need merging data across branches. It is equivalent to operations across datasets.

Somehow there are several issues in this approach:

Datafile is located by the path field in manifests, however deletion file path is coupled with its file name({fragment_id}-{read_version}-{id}.{extension}), which means deletion files can not be shared across branches because data fragment id could be overlapped across branches. Honestly I don't quite get the benifits of deletion file naming. I would be very grateful if someone could guide me on this. Anyway I think putting path in deletion file structure could solve this issue and keep compatible
Indexes are quite heavy if we put them under different branches(could index files be shared after rebuilding?). I don't know if Indexes could be treated as versioned. The usual words we spoke is 'rebuild the index'. However, when we need to search on a branch, creating an index on the branch may still be necessary.

I think we'd need to probably use the branch name in the manifest name. So, to commit to the latest main today you commit to X.manifest. If we want to commit to the latest mybranch we could write the manifest to X.mybranch.manifest. Manifests could store their parent manifest for version tracking.

I haven't given much thought to this approach yet. I'd be very open to trying your solution if we could discuss it further @westonpace

I could provide a detailed document if we align the core approach. @jackye1995 @westonpace

The branch feature is indeed very useful and has proven effective in ByteDance's ML platform (built on Iceberg). I really look forward to contributing to this feature and seeing it integrated into our data products soon.

0 replies

jackye1995 · 2025-06-18T21:47:23Z

jackye1995
Jun 18, 2025
Maintainer Author

I was chatting with a few Iceberg and Delta committers during the Databricks Summit last week, and I am feeling increasingly strong that a shallow table clone seems to be the easiest way to offer a branching experience. It does not have the complexity of internal branching of Iceberg, and a separated table can enjoy all the features that are built in to tables (e.g. access control, discovery, UI/UX, maintenance). And this can actually be achieved pretty easily.

Suppose you have table A, and you want to create a temp branch to do experiments. You can basically create a table B with the manifest of A at a specific version (let's say version X).

On A's side, all we really need to do is to add a tag at version X, so that its data will not be deleted. It doesn't really need to have a knowledge of inter-table dependency. We could make the tag name like branch-B for users to understand the purpose of the tag. Once B is dropped, we can just drop A's tag branch-B. Ideally in theory this should be done atomically, but doing this in 2 separated transactions seems to be sufficient.

On B's side, I think there are 2 gaps to fill:

Relative vs absolute path

Lance reference everything with relative path, but here we need to copy the version X manifest to B's directory, and also modify X to X' where X' has all the paths as absolute paths pointing back to files in A. This requires some thorough examination in the current implementation to make sure we can accept both a absolute and relative path for reading.

CDC view of a table

If we write to this experimental table B, likely we will want to write the data back to A if things look good. The user could either rerun the pipeline, or it would be nice for user to have a CDC view of what has changed in B from version X to the latest version in B, and then apply the same change to A.

@westonpace @majin1102 thoughts?

2 replies

majin1102 Jun 19, 2025

I was chatting with a few Iceberg and Delta committers during the Databricks Summit last week, and I am feeling increasingly strong that a shallow table clone seems to be the easiest way to offer a branching experience. It does not have the complexity of internal branching of Iceberg, and a separated table can enjoy all the features that are built in to tables (e.g. access control, discovery, UI/UX, maintenance). And this can actually be achieved pretty easily.

The idea of shallow clone sounds pretty good to me.

The approach seems only sharing the initialized data and branches have separeted timeline natively. I think there’s one point that requires extra attention: In the previously discussed branch-based approach, the concept of branches is nested under a dataset. When we drop a dataset, we expect all its branches will be dropped. However, under the shallow clone approach, branches and the main dataset become separated entities. Therefore, when the main dataset is dropped, we may need to provide necessary options or warn users that associated branches will be affected.
I think CDC maybe a further step beyond branches. Most cases I know is appending columns for experiments, so that a merge operation could be used? If any other scenarios should be noted, please let me know.

jackye1995 Jun 19, 2025
Maintainer Author

When we drop a dataset, we expect all its branches will be dropped. However, under the shallow clone approach, branches and the main dataset become separated entities.

Good point. Technically if the source table is dropped, then the cloned table does not have any data so it's just a hanging manifest file that is not useable.

But I think it might make sense to add a feature to say that if there are tags in this table, we have an option to fail the drop. Even outside the context of branching, when people add a tag, it basically means "this version is important and keep it", which is contrary to the drop table operation that drops this version.

Most cases I know is appending columns for experiments, so that a merge operation could be used?

Yes. Maybe CDC view is not the right term then, this is basically what I mean, just "merging back whatever change happened in the experiment table back to the source table".

bryanck · 2025-07-01T16:48:24Z

bryanck
Jul 1, 2025

We implemented shallow clone in Iceberg, before branching was available, and it resulted in a lot of complexity around file cleanup, i.e. you need to keep track of all tables that reference shared files.

5 replies

jackye1995 Jul 1, 2025
Maintainer Author

it resulted in a lot of complexity around file cleanup

Curious to know how you are using branching and tagging today. Do you utilize its lifecycle policy, or do you mainly use it to just keep a tag/branch until you remove it?

bryanck Jul 1, 2025

We use it both ways, with the lifecycle policy and also keep it until we remove it.

jackye1995 Jul 1, 2025
Maintainer Author

It almost feels like it's a hard tradeoff. Take access control for example, one common concern I received when trying to promote using branching is that, the user wants to assign a separate permission to the experimental branch but they cannot do it since it's within a table which inherits the table permission, so they continue to fallback to just create a shallow clone. On the other side, this is desirable for features like write-audit-publish, that you do not want to create a temp table that is exposed and then delete it after it is merged back.

Maybe the answer is just that we would eventually have both for different use cases I guess.

bryanck Jul 1, 2025

Security was another area where shallow clone added complexity for us, e.g. there are multiple s3 prefixes and table ACLs to consider.

jackye1995 Jul 1, 2025
Maintainer Author

if we want to do data level governance for a specific branch, then either way is tricky. Is the current model for you basically that user who has access to the branch must have access to the table, or is it more intricate?

jackye1995 · 2025-07-02T21:42:30Z

jackye1995
Jul 2, 2025
Maintainer Author

Trying to have a side-by-side comparison:

	Branching	Shallow Clone
Creating an experimental branch/table	Create a branch in the table	Create a shallow clone that becomes a new table.Create a tag in the source table to prevent data from being deleted in the source table.
Making changes in the experimental branch/table	Write data to the specific branch, with some sort of integration from the calling engine to support writing to the branch, e.g. the table name needs to be parsed to understand the branch name like table.branch_xxx, or the branch is set in the environment through some job or cluster level config.	Write to the shallow clone table
Read the experimental branch/table	Read data, likely treat branch as a time travel reference (e.g. SELECT * FROM table VERSION AS OF ‘xxx’), or support some convention in table name like table.branch_xxx.	Read the shallow clone table
Merge data back to the source table	Read data and merge into source, or have some clever way to apply the same transaction against the source tableThen drop the branch within the table.	Read data and merge into source, or have some clever way to apply the same transaction against the source tableThen drop the shallow clone table.
Configure catalog access to the experimental table/branch	Configure access to the branch. This is usually not supported, so it’s sharing the same catalog access with the source table.	Configure access to the table
Configure data access to the experimental table/branch	Not possible, shared data access with the source table	Possible for new data since it is at a new location, but still need access to the original version of source table data. So technically it is still not possible.
User visibility	User can never see this experimental branch outside the table. The visibility is tied to the source table access permission.	If having permission on this experiment table, user can see it outside the table.

The point I have been trying to make is that, whatever operation in branching, it needs special integration from the calling engine. It is for sure possible to achieve, but it is also where I see the main adoption blocker. For Iceberg, it was basically Spark supporting everything and other engines/SDKs support only small set of operations. I hope we don't fall into the same situation here.

0 replies

jackye1995 · 2025-07-02T22:12:24Z

jackye1995
Jul 2, 2025
Maintainer Author

With that being said, I think I might have found a way to combine these 2 approaches together, by allowing a table to be created within another table (still very early thinking):

Basically, we can use shallow clone as the core logic, and branching is another layer of better user experience on top. A shallow clone can clone a table to a new location, or a location within a table. For later, a .lanceignore file is created to prevent cleanup process accidentally delete the data in nested table directory. For both cases, a tag will be created for the version being cloned to prevent if from being deleted.

For a branch unique experience, we further make sure that:

the cloned table is not registered in the catalog directly
dedicated semantics for using and managing the branch, this requires per-engine integration.

0 replies

majin1102 · 2025-07-03T06:27:47Z

majin1102
Jul 3, 2025

Thank to @jackye1995 and @bryanck for the highly detailed input
.
I'm currently developing a prototype of shallow_clone, aiming to submit it for community review in approximately 1-2 week. Additionally, regarding how to implement complete branch functionality based on shallow_clone while addressing file cleanup challenges, I plan to draft a document alongside the prototype for ongoing discussions in about 2-3 weeks(may collect some comments on the prototype).

1 reply

jackye1995 Jul 3, 2025
Maintainer Author

Thank you for helping with this! Looking forward to the work 🚀

Branching in Lance #3861

Uh oh!

jackye1995 May 22, 2025 Maintainer

Replies: 8 comments · 8 replies

Uh oh!

westonpace May 23, 2025 Maintainer

Uh oh!

Uh oh!

jackye1995 May 23, 2025 Maintainer Author

Uh oh!

Uh oh!

majin1102 May 27, 2025

Uh oh!

jackye1995 Jun 18, 2025 Maintainer Author

Relative vs absolute path

CDC view of a table

Uh oh!

Uh oh!

majin1102 Jun 19, 2025

Uh oh!

jackye1995 Jun 19, 2025 Maintainer Author

Uh oh!

bryanck Jul 1, 2025

Uh oh!

jackye1995 Jul 1, 2025 Maintainer Author

Uh oh!

bryanck Jul 1, 2025

Uh oh!

jackye1995 Jul 1, 2025 Maintainer Author

Uh oh!

bryanck Jul 1, 2025

Uh oh!

jackye1995 Jul 1, 2025 Maintainer Author

Uh oh!

Uh oh!

jackye1995 Jul 2, 2025 Maintainer Author

Uh oh!

Uh oh!

jackye1995 Jul 2, 2025 Maintainer Author

Uh oh!

majin1102 Jul 3, 2025

Uh oh!

jackye1995 Jul 3, 2025 Maintainer Author

jackye1995
May 22, 2025
Maintainer

Replies: 8 comments 8 replies

westonpace
May 23, 2025
Maintainer

jackye1995
May 23, 2025
Maintainer Author

majin1102
May 27, 2025

jackye1995
Jun 18, 2025
Maintainer Author

jackye1995 Jun 19, 2025
Maintainer Author

bryanck
Jul 1, 2025

jackye1995 Jul 1, 2025
Maintainer Author

jackye1995 Jul 1, 2025
Maintainer Author

jackye1995 Jul 1, 2025
Maintainer Author

jackye1995
Jul 2, 2025
Maintainer Author

jackye1995
Jul 2, 2025
Maintainer Author

majin1102
Jul 3, 2025

jackye1995 Jul 3, 2025
Maintainer Author