virtual-box
vagrant
-
Start the vagrant machine:
cd <project-root-dir> vagrant up vagrant ssh
-
Config
kubectl
with the K8s cluster of your choice. (The vagrant machine hasgcloud
CLI andibmcloud
CLI installed.) -
(Optional) Build and update the training container: 3.1 Build the Docker image:
cd /vagrant/dist_training_keras_k8s docker build -t dist_training:latest .
3.2 (Optional) Test the container:
# create a network docker network create \ --subnet=172.18.0.0/16 \ --gateway=172.18.0.1 \ test-net # run worker0 container docker run --rm \ --name worker0 \ -p 12345:12345 \ --network=test-net \ dist_training:latest # run worker1 container docker run --rm \ --name worker1 \ -p 23456:23456 \ --network=test-net \ -e "TF_CONFIG={\"cluster\":{\"worker\":[\"172.18.0.1:12345\",\"172.18.0.1:23456\"]},\"task\":{\"type\":\"worker\",\"index\":1}}" \ dist_training:latest
3.3 Push the image to a registry and replace the image location in the K8s yaml files.
# using Docker hub as example: # first, create a repository on Docker hub. Then run the following: docker logout docker login docker tag dist_training:latest <user_name>/<repo_name>:latest docker push <user_name>/<repo_name>:latest # replace spec.template.spec.containers.image in both yaml files.
-
Deploy to K8s cluster:
kubectl apply -f worker0.yaml kubectl apply -f worker1.yaml # NOTE: lacking resources could cause the K8s job to fail # The jobs will restart automatically on failure. # Run this to check the termination message. # See https://kubernetes.io/docs/tasks/debug-application-cluster/determine-reason-pod-failure/ for more information. kubectl get pods kubectl get pod <pod-name> --output=yaml
-
Get the log. You should be able to see both workers finish one epoch. NOTE: You can config the number of epochs in the yaml file, but the number should be the same in the two yaml files.
kubectl logs <pod-name>
-
Lanch a K8s cluster: On IBM Cloud, create a
classic
K8s cluster with default settings. This is not included in free tier or provided class resources, but TFJob requires that cluster is set up for RWX (read-write multiple nodes) type of storages. See this page for more information. -
Config the classic cluster following this instruction. The commands are also included as shell script in this repo.
cd /vagrant ibmcloud login sh setupCluster.sh <cluster_name> # NOTE: don't run the script more than once
-
Install
kfctl
: Downloadkfctl_v1.2.0-0-gbc038f9_linux.tar.gz
from https://github.com/kubeflow/kfctl/releases, unzip, and put it under either/vagrant
or/home/vagrant
. -
Following this instruction to install Kubeflow to the cluster. The commands are also included as scripts in this repository:
cd /vagrant sh installKF.sh
-
Expose the Kubeflow Dashboard following this instruction. The commands are also included in this repository as a shell script.
cd /vagrant sh exposeLB.sh # You will see the IP to access your Kubeflow dashboard in the EXTERNAL_IP column. # Login to the dashboard with the following credentials: username=user@example.com password=12341234
-
Launch a Jupyter notebook server.
-
Launch a terminal in Jupyter and clone the Kubeflow examples repo.
git clone https://github.com/kubeflow/examples.git git_kubeflow-examples
-
Open the notebook
mnist/mnist_ibm.ipynb
. And follow the notebook to train and deploy TFJob for MNIST on Kubeflow.8.1. When running
reload(notebook_setup)
, you may encounter error withgit checkout
. Make the following modification tomnist/notebook_setup.py
:clone_dir = os.path.join(home, "git_tf-operator") if not os.path.exists(clone_dir): logging.info("Cloning the tf-operator repo") subprocess.check_call(["git", "clone", "https://github.com/kubeflow/tf-operator.git", clone_dir]) logging.info(f"Checkout kubeflow/tf-operator @{TF_OPERATOR_COMMIT}") # Add this line: subprocess.check_call(["cd", clone_dir]) subprocess.check_call(["git", "checkout", TF_OPERATOR_COMMIT], cwd=clone_dir)
8.2 When running into error:
ImportError: No module named 'msrestazure'
, the quick workaround is to install it withpip3 install msrestazure
8.3 You should be able to successfully create the TFJob now. NOTE: the further steps in the notebook is not tested in our project.
This project is derived from the following works: