From shallow clone to branching #4256

majin1102 · 2025-07-18T13:05:06Z

majin1102
Jul 18, 2025

Motivation

As discussed in #3861

Goals

Support shallow clone to lightly copy a dataset(only metadata) with an individual operation timeline and flexable access control etc. Shallow clone has led to the complexity around file cleanup.
Support branch not only has the ability of shallow clone but also the file lifetime mantainence relations with the original dataset.
Try to make branch reuse the capability of shallow cloning

Overall design

For the main experiment scenario. There are two basic roles:

Dataset/table maintainer: usually the owner of the data
Experimenters: there could be mulitple users doing experiments on the dataset and making some changes(mostly would be merging new columns and comparing):

The normal procedure of shallow cloning in this case would be:

The dataset owner create a tag(which means this is a meaningful and sharable snapshot). Experimenters usually only have read-only authorities of source datasets including cloning.
Experimenters use shallow cloning functions to get a new dataset in their target directories(or namespaces). They have read and write authorities of cloned datasets.
After experimenters done, experimenters may notify the data owner to merge the stable experimental result. This could be an operation across the source dataset and cloned dataset.
When the cloned datset is created, it doesn't copy any data from source dataset, but storing the reference or absolute paths of source dataset files

Shallow clone

Shallow clone would introduce two key concepts:

Clone operation: needs to extend in Operation
reference_paths: stores all the referenced path in this manifest.

There's a prototype for shallow cloning: #4257

reference_paths

Add reference_paths into manifest specification:

    /* reference to source datasets*/
    pub reference_paths: Vec<String>,

When we call dataset.shallow_clone(), the produced dataset manifest shoud:

clone the source manifest with the speicified referent version
add the source dataset path into its reference_paths
use the largest index of reference_paths as the path_base_index in each Datafile and DeletionFile:

pub struct DataFile {
    ......
    /// The base path of the datafile, when the datafile is a reference to another dataset.
    pub path_base_index: Option<u32>,
}

When we want a datafile or deletion file path, we need check the path_base_index field. If path_base_index is present, source the real base path from manifest by reference_paths to generate the actual file path.

Refering implementation

We have been discussed about using absolute path to replace relative path to refer files in the previous discussion. I think I have figured out a way to avoid introducing absolute path and the complexity of compatibility and migration with it. In this implementation I think we don't need to modify the spec of DataFragment.

Data file refering:
If fragment_id > reference.max_fragment_id, use the cloned dataset relative path. Otherwise, use the source dataset path.
if deletion_file.read_version >= reference.version, use the cloned dataset relative path. Otherwise, use the source dataset path.
Index file refering: I think index file refering could use a similar approach to refer. But I'm not sure wether it's necessary to shallow clone index files since I believe we may not need index files in experiment scenarios. So I did't implement index file cloning in the prototype.

Clone operation

I believe shallow_clone should be a new type of Operation, similar to Overwrite which writes batch data while clone writes a reference metadata.

The Clone operation:

    Clone {
        is_shallow: bool, 
        ref_name: String,
        ref_version: u64,
        ref_path: String,
        ref_path_index: u32,
    }

is_shallow: is this a shallow or deep clone operation
ref_name: referent tag name or the branch name in the future
ref_version: referent source dataset version
ref_path: the source dataset absolute path (without scheme and bucket name)
ref_path_index: the path index used in this clone operation (considering we could clone from a cloned dataset)

Note:
is_shallow means this is a deep clone or shallow clone. Compared with shallow cloning, I think deep cloning could be useful we want to get a clean dataset at some version or tag without concerning of file lifetime relationships. At the meanwhile, deep cloning would be faster, efficient and cheaper by OSS coping than using SDKs like python. I did't implement deep cloning in the prototype.

majin1102 · 2025-07-21T08:08:33Z

majin1102
Jul 21, 2025
Author

Branching(TODO)

As discussed in #3861 branching could be a narrow case of shallow cloning:

The cloning path is right under the source dataset root path like:

dataset
    _version
    _index
    _transaction
    _branch
        cloned_dataset
    data

is_strong_ref is set to true and we need to implement the files lifetime management of this strong reference. This is similar to the rotetion policies in Iceberg: https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit?tab=t.0
I think for branching case, we don't need copy the whole manifest, just only Reference struct.

0 replies

jackye1995 · 2025-07-22T22:33:58Z

jackye1995
Jul 22, 2025
Maintainer

Some latest discussions: #4257 (comment)

1 reply

majin1102 Jul 23, 2025
Author

I will follow the design doc with a little latency with the discussion in that prototype

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

From shallow clone to branching #4256

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

From shallow clone to branching #4256

Uh oh!

Uh oh!

majin1102 Jul 18, 2025

Motivation

Goals

Overall design

Shallow clone

reference_paths

Refering implementation

Clone operation

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

majin1102 Jul 21, 2025 Author

Branching(TODO)

Uh oh!

jackye1995 Jul 22, 2025 Maintainer

Uh oh!

Uh oh!

majin1102 Jul 23, 2025 Author

majin1102
Jul 18, 2025

Replies: 2 comments 1 reply

majin1102
Jul 21, 2025
Author

jackye1995
Jul 22, 2025
Maintainer

majin1102 Jul 23, 2025
Author