Replies: 2 comments 1 reply
-
Branching(TODO)As discussed in #3861 branching could be a narrow case of shallow cloning:
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Some latest discussions: #4257 (comment) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
As discussed in #3861
Goals
Overall design
For the main experiment scenario. There are two basic roles:
The normal procedure of shallow cloning in this case would be:
Shallow clone
Shallow clone would introduce two key concepts:
There's a prototype for shallow cloning: #4257
reference_paths
Add reference_paths into manifest specification:
When we call dataset.shallow_clone(), the produced dataset manifest shoud:
When we want a datafile or deletion file path, we need check the
path_base_index
field. If path_base_index is present, source the real base path from manifest by reference_paths to generate the actual file path.Refering implementation
We have been discussed about using absolute path to replace relative path to refer files in the previous discussion. I think I have figured out a way to avoid introducing absolute path and the complexity of compatibility and migration with it. In this implementation I think we don't need to modify the spec of DataFragment.
If fragment_id > reference.max_fragment_id, use the cloned dataset relative path. Otherwise, use the source dataset path.
if deletion_file.read_version >= reference.version, use the cloned dataset relative path. Otherwise, use the source dataset path.
Clone operation
I believe shallow_clone should be a new type of Operation, similar to Overwrite which writes batch data while clone writes a reference metadata.
The Clone operation:
Note:
is_shallow means this is a deep clone or shallow clone. Compared with shallow cloning, I think deep cloning could be useful we want to get a clean dataset at some version or tag without concerning of file lifetime relationships. At the meanwhile, deep cloning would be faster, efficient and cheaper by OSS coping than using SDKs like python. I did't implement deep cloning in the prototype.
Beta Was this translation helpful? Give feedback.
All reactions