Skip to content

parallelworks/ml-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ml-workshop

Machine learning workshop as an introduction to the Parallel Works ACTIVATE user experience. ACTIVATE is a single control plane for cloud and on-premise high performance resources.

Please find the link to the Day 1 overview presentation here.

Please find the link to the Day 2 walkthrough presentation here.

Summary

The main activities of this workshop are to:

  1. Start a personal cloud cluster
  2. Start notebook session on cluster
  3. Download notebook from public repository to cluster
  4. Run notebook on cluster
  5. Copy files to different storage (bucket, workspace)
  6. Track cost in near real time
  7. Launch MPI job via script_submitter (optional)
  8. Stop JupyterLab session and the cluster

Help

support@parallelworks.com

Parallel Works documentation

Detailed steps

1) Login and start a personal cloud cluster

  • Log into the platform by going to hpcmp-cloud.parallel.works.
  • Change your password immediately after this session. Initial login can be complicated by:
    • delayed/filtered password reset messages and
    • cannot use PED for MFA in certain locations.
  • On the Home page, go to the Compute tile and click on the On button for your default cluster.
  • Cloud cluster startup takes ~2-5 minutes.
  • Please explore - but do not change - the configuration with the i button.
  • You may experiment with other cluster configurations after the workshop.
  • In particular, note that the cluster has the following parts:
    • a larger head node (best for running the notebook)
    • a small compute partition with two worker nodes that spin up elastically
    • a mounted disk image at /pw/apps
    • a mounted shared bucket at /pw/bbb
    • the home directory of the cluster is mounted into your ACTIVATE user workspace. Cluster schematic

2) Start notebook session on cluster

  • On the ACTIVATE Home page, click on the JupyterLab workflow tile
  • There is no need to change the fields on the workflow launch page - your one running cluster autopopulates.
  • Default settings include using cached software on /pw/apps to minimize JupyterLab startup time and bypass the need to install TensorFlow.
  • You are welcome to explore the options after the workshop.
  • Click on the Execute button.
  • You can stay on the workflow launch status page, but it's more interesting to:
    • go to your Home page
    • notice your session is starting up (Sessions tile)...
    • ...and the workflow is running (Workflow Runs tile)
    • A workflow is just a series of automated steps.
    • An interactive session is a special type of workflow whose steps include the setup for sending graphics from the cloud cluster to your ACTIVATE workspace.
    • Workflows can also be purely computational (i.e. running a simulation) or even a mix of non-graphical and graphical applications.
    • Workflows are defined in an easy to use .yaml format; this is beyond the scope of the workshop but PW documentation has more information.
    • click on the run number of the JupyterLab workflow (i.e. 00001) to view the workflow progress and logs
    • JupyterLab is ready when Create session has a green checkmark in the workflow viewer or there is a green light for the entry in the Sessions tile on ACTIVATE Home. JupyterLab workflow viewer

3) Download notebook from public repository to cluster

  • Access your JupyterLab session on the head node of the cluster by clicking on its session in the Session tile on ACTIVATE Home.
  • Often, it's convenient to use the Open in new tab button to place the session in its own browser tab.
  • Use the JupyterLab launcher tab to start a terminal (you may need to scroll down)
  • In the terminal in your JupyterLab session, please run git clone https://github.com/parallelworks/ml-workshop to place a copy of this repository on your cluster.
  • You should see ml-workshop in the file browser portion (left sidebar) of JupyterLab
  • You are also welcome to run simple Linux terminal commands like hostname, whoami, sinfo, and squeue to verify that you are on the head node of a SLURM cluster. JupyterLab screenshot

4) Run notebook on cluster

  • Start the notebook by clicking on ml-workshop in the JupyterLab file browser and then cvae_example.ipynb.
  • The notebook stores code, output, and error messages all in the same file.
  • The error messages here are totally normal.
  • Notebook cells can be run individually by selecting them and clicking on the Play icon (right pointing arrowhead).
  • Or, you can go to the top menu and select Kernel > Restart Kernel and Run All to engage all the cells.
  • While running the steps of the notebook:
    • This small example of generative AI trains a neural network to recognize handwritten digits (0-9).
    • A citation, summary of the job, and an example of extending this approach to a bigger science application is presented at the top of the notebook.
    • The training and visualization steps will each take a few minutes
    • While they run, if you have opened the JupyterLab session in its own tab, you can go back to the ACTIVATE Home page on your original browser tab to verify your session/workflow is still running.
    • If you haven't opened the JupyterLab session in its own tab, depending on your browser settings, your notebook can be interupted. If this happens, just open the notebook again and rerun it.
    • You can monitor CPU/RAM/disk usage in near real time by selecting the i button on the line of your cluster in the Compute tile.
    • Or, you can monitor resource usage in the terminal with htop, etc.

5) Copy files to different storage (bucket, workspace)

  • There are several persistent storage options integrated with your ephemeral cloud cluster.
  • /pw/bbb is a shared cloud bucket mounted to the cluster.
  • For simplicity, this bucket is shared among all workshop participants; you can overwrite each other's files here!
  • The example snippet below will write your username to a file with the same name in the bucket. It should be overwriting-safe since all the usernames are different.
# Create a file
echo $USER > /pw/bbb/${USER}

# Check if the file is created
ls /pw/bbb/
  • You can get short term credentials to the bucket and examples for use with standard CSP CLI tools by clicking on the Buckets tab on the left sidebar of your ACTIVATE Home, selecting the bbb bucket, and then clicking on the Credentials button in the upper right corner.
  • The home directory of your cluster is also mounted into your persistent ACTIVATE workspace. You can view the files by clicking on the Editor tab on the left sidebar. The Editor tab also opens an integrated development environment (IDE) associated with your private workspace on ACTIVATE.
  • You can upload/download files from your ACTIVATE workspace IDE to and from your local computer as well as drag and drop files in the file browser between clusters (i.e. each cloud or on-premise cluster connected to your ACTIVATE account can mount to your IDE). This functionality supports files up to 8GB in size. For larger files, using CSP CLI tools (e.g. aws s3 ..., glcoud storage ..., or az storage ...) or other data tools are recommended.

6) Track cost in near real time

  • Go back to the ACTIVATE Home page.
  • Click on the $ Cost menu item on the left sidebar.
  • You may need to set the group to ml-workshop in the ribbon/filter bar across the top of the cost dashboard.
  • To see the spend assocaited with your account, click on Filter Options, select User, and select your username from the list.
  • With many users from the same group on the Cost page at the same time, it may be necessary to refresh the page when adjusting the filters (circle arrow button on browser).

7) Launch MPI job via script_submitter (optional)

  • OpenMPI is already installed at /pw/apps/ompi.
  • If you run run_mpitest.sh in this repository, it will:
    • set up the system paths to access OpenMPI,
    • compile the hello world MPI source code provided here, and
    • run the code over 4 CPUs distributed over two worker nodes.
    • You can check for the status of this multiple node job with sinfo and squeue in another terminal.
  • You can also copy and paste the contents of run_mpitest.sh into the script_submitter workflow's launch page to run the script on the cluster as if it were a formal workflow.

8) Stop running workflows and clusters

  • When you are finished with the workshop, please shut down your ephemeral cloud cluster to avoid unnecessary costs.
  • On your ACTIVATE Home, click on the "do not enter" symbol next to the JupyterLab entry in the Workflow runs tile to stop the workflow. This action will also automatically cancel the associated JupyterLab session in the Sessions tile.
  • If you have started other workflows and you don't need to continue them, please cancel them in the same way as the JupyterLab workflow.
  • Technically, you don't need to cancel workflows and sessions before turning off a cluster, but it's generally best practice because some things get cleaned up in the background.
  • On your ACTIVATE Home, click the On/Off button next to your cluster so that the button goes from green to gray.

About

Machine learning workshop

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages