This repository contains everything needed to launch a multi-node Ray cluster on an HPC system using SLURM and Apptainer (formerly Singularity).
It runs:
- A Ray head node (CPU-only)
- One or more Ray worker nodes (with GPU support)
- A Python job script using Ray for distributed processing
- All via a single SLURM batch script
File | Description |
---|---|
ray_cluster_all_in_one.sh |
SLURM batch script to launch head, workers, and job |
my_ray_script.py |
Example Ray script with remote GPU tasks |
ray_container.def |
Apptainer definition file to build container |
- Ray head is started on the first SLURM node (no GPU requested).
- Remaining nodes launch Ray workers with GPU support.
- A Python script runs on the head node using Ray to distribute work.
- All processes are launched using
srun
and managed by SLURM.
- A SLURM-managed HPC environment
- Apptainer installed
- Apptainer container image built from the provided definition file
apptainer build ray_container.sif ray_container.def
Update the path in ray_cluster_all_in_one.sh
:
CONTAINER_PATH=/full/path/to/ray_container.sif
Submit the full cluster workload with:
sbatch ray_cluster_all_in_one.sh
It will:
- Start Ray head on node 1
- Start workers on remaining nodes
- Run
my_ray_script.py
using Ray - Shut down all Ray processes
To access the Ray dashboard:
ssh -L 8265:localhost:8265 user@cluster
Then open: http://localhost:8265
This script connects to the running Ray cluster and distributes 20 square operations across nodes:
@ray.remote(num_gpus=0.25)
def square(x): return x * x
You can replace this with any workload using Ray's APIs.
Want to change... | Do this |
---|---|
Number of nodes | Edit #SBATCH --nodes=X |
GPUs per worker | Edit --gres=gpu:X and --num-gpus=X |
CPUs per task | Edit --cpus-per-task=X and --num-cpus=X |
Job duration | Adjust #SBATCH --time=... |
Container content | Edit ray_container.def |
- Shared
/home
or/scratch
across nodes - Passwordless SSH not needed (SLURM handles all
srun
) - GPU-enabled worker nodes (e.g., V100, A100)
MIT or your institution's default open-source license.
File an issue or contact your HPC support team if you need help with SLURM or Apptainer permissions.