Distributed Training with and without Kubeflow

Prerequisite

virtual-box
vagrant

Distributed Training on K8s cluster

Start the vagrant machine:

cd <project-root-dir>
vagrant up
vagrant ssh

Config kubectl with the K8s cluster of your choice. (The vagrant machine has gcloud CLI and ibmcloud CLI installed.)

(Optional) Build and update the training container: 3.1 Build the Docker image:

cd /vagrant/dist_training_keras_k8s
docker build -t dist_training:latest .

3.2 (Optional) Test the container:

# create a network
docker network create \
--subnet=172.18.0.0/16 \
--gateway=172.18.0.1 \
test-net

# run worker0 container
docker run --rm \
--name worker0 \
-p 12345:12345 \
--network=test-net \
dist_training:latest

# run worker1 container
docker run --rm \
--name worker1 \
-p 23456:23456 \
--network=test-net \
-e "TF_CONFIG={\"cluster\":{\"worker\":[\"172.18.0.1:12345\",\"172.18.0.1:23456\"]},\"task\":{\"type\":\"worker\",\"index\":1}}" \
dist_training:latest

3.3 Push the image to a registry and replace the image location in the K8s yaml files.

# using Docker hub as example:    
# first, create a repository on Docker hub. Then run the following:
docker logout
docker login
docker tag dist_training:latest <user_name>/<repo_name>:latest
docker push <user_name>/<repo_name>:latest

# replace spec.template.spec.containers.image in both yaml files.

Deploy to K8s cluster:

kubectl apply -f worker0.yaml
kubectl apply -f worker1.yaml

# NOTE: lacking resources could cause the K8s job to fail
# The jobs will restart automatically on failure.
# Run this to check the termination message.
# See https://kubernetes.io/docs/tasks/debug-application-cluster/determine-reason-pod-failure/ for more information.
kubectl get pods
kubectl get pod <pod-name> --output=yaml

Get the log. You should be able to see both workers finish one epoch. NOTE: You can config the number of epochs in the yaml file, but the number should be the same in the two yaml files.
```
kubectl logs <pod-name>
```

Distributed Training with Kubeflow

Lanch a K8s cluster: On IBM Cloud, create a classic K8s cluster with default settings. This is not included in free tier or provided class resources, but TFJob requires that cluster is set up for RWX (read-write multiple nodes) type of storages. See this page for more information.
Config the classic cluster following this instruction. The commands are also included as shell script in this repo.
```
cd /vagrant 
ibmcloud login
sh setupCluster.sh <cluster_name>
# NOTE: don't run the script more than once 
```
Install kfctl: Download kfctl_v1.2.0-0-gbc038f9_linux.tar.gz from https://github.com/kubeflow/kfctl/releases, unzip, and put it under either /vagrant or /home/vagrant.
Following this instruction to install Kubeflow to the cluster. The commands are also included as scripts in this repository:
```
cd /vagrant
sh installKF.sh
```

Expose the Kubeflow Dashboard following this instruction. The commands are also included in this repository as a shell script.

cd /vagrant
sh exposeLB.sh

# You will see the IP to access your Kubeflow dashboard in the EXTERNAL_IP column.
# Login to the dashboard with the following credentials:
username=user@example.com
password=12341234

Launch a Jupyter notebook server.

Launch a terminal in Jupyter and clone the Kubeflow examples repo.

git clone https://github.com/kubeflow/examples.git git_kubeflow-examples

Open the notebook mnist/mnist_ibm.ipynb. And follow the notebook to train and deploy TFJob for MNIST on Kubeflow.

8.1. When running reload(notebook_setup), you may encounter error with git checkout. Make the following modification to mnist/notebook_setup.py:

clone_dir = os.path.join(home, "git_tf-operator")
if not os.path.exists(clone_dir):
    logging.info("Cloning the tf-operator repo")
    subprocess.check_call(["git", "clone", 
                        "https://github.com/kubeflow/tf-operator.git",
                        clone_dir])
logging.info(f"Checkout kubeflow/tf-operator @{TF_OPERATOR_COMMIT}")

# Add this line:
subprocess.check_call(["cd", clone_dir])

subprocess.check_call(["git", "checkout", TF_OPERATOR_COMMIT], cwd=clone_dir)

8.2 When running into error: ImportError: No module named 'msrestazure', the quick workaround is to install it with pip3 install msrestazure

8.3 You should be able to successfully create the TFJob now. NOTE: the further steps in the notebook is not tested in our project.

Acknowledgement

This project is derived from the following works:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dist_training_keras_k8s		dist_training_keras_k8s
.gitignore		.gitignore
README.md		README.md
Vagrantfile		Vagrantfile
exposeLB.sh		exposeLB.sh
installKF.sh		installKF.sh
setupCluster.sh		setupCluster.sh
uninstallKF.sh		uninstallKF.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Training with and without Kubeflow

Prerequisite

Distributed Training on K8s cluster

Distributed Training with Kubeflow

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

liondon/CML_prj2_dist_training

Folders and files

Latest commit

History

Repository files navigation

Distributed Training with and without Kubeflow

Prerequisite

Distributed Training on K8s cluster

Distributed Training with Kubeflow

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages