Skip to content

DagsHub/dagshub-import-export

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DagsHub Import-Export tool

This Python package can be used as a CLI to copy almost all aspects of a DagsHub repo, even across different DagsHub installations.

What's being copied:

  • Git
  • Data stored in the repo's DagsHub storage bucket
  • DVC managed data, provided that it's pushed to the DagsHub repo's builtin DVC remote
  • MLflow experiments
  • MLflow models
  • MLflow artifacts
  • DagsHub Data Engine datasources and datasets, including all latest metadata, annotations, etc.
  • Label Studio projects and tasks. (Might require answering to manual prompts to choose annotation fields)

Not copied yet:

  • Unsaved Label Studio tasks. Prior to running this tool, you should click the green Save icon at the top of the Label Studio project UI to save the tasks as metadata back to Data Engine, which does get copied and can be used to re-create the Label Studio project and tasks in the new location.
  • MLflow traces (LLM traces feature)
  • Historical versions of Data Engine (datasources) metadata values. Only the current version of metadata values is currently copied. Contact us if copying the full history is a requirement.

Requirements

To run: Python 3.10 or later

Prerequisites:

  • The source and destination repositories exist and are accessible from this computer, and you have permission to write to the destination.
  • (for datasources) The destination repository has (at least) the same connected buckets as the source repository. You need to manually connect the same buckets in the target repo before starting this script. If any bucket is missing in the target, the script will print a warning and refuse to start.
  • (for MLflow) There is nothing in the destination repository's MLflow. If any experiments are already logged in the target repo, the script will print a warning and refuse to start.
  • Since Label Studio projects and tasks are currently not copied, it's strongly recommended that you Save any existing Label Studio projects back to a Data Engine metadata field. Otherwise, the existing human-labeled task data will be left behind.

Installation

With uv (recommended):

uv tool install git+https://github.com/dagshub/dagshub-import-export

With regular pip (recommended to create a virtual environment first):

pip install git+https://github.com/dagshub/dagshub-import-export

Usage

dagshub-import-export https://one-dagshub-deployment.com/user/repo-to-import-from https://another-dagshub-deployment.com/other-user/other-repo-to-export-to

By default, everything will be imported.

You can choose which specific things to import by using corresponding flags:

  • --git
  • --dvc
  • --bucket
  • --datasources
  • --metadata
  • --mlflow
  • --labelstudio

Turning any of them on will make it so ONLY the subsystems set with the flags are imported.

Includes a vendored version of MLflow's mlflow-export-import, licensed under Apache 2.0, with DagsHub specific fixes

FAQ

Why is importing MLflow experiments taking a very long time?

Yes, unfortunately MLflow's export-import currently takes a long time to run. Please be patient with it and give it a chance to run un-interrupted, as stopping it in the middle will require restarting the whole process.

About

Export and import DagsHub repos, even across installations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages