Skip to content

vantagecompute/ray-on-slurm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Ray + SLURM + Apptainer Cluster Launcher

This repository contains everything needed to launch a multi-node Ray cluster on an HPC system using SLURM and Apptainer (formerly Singularity).

It runs:

  • A Ray head node (CPU-only)
  • One or more Ray worker nodes (with GPU support)
  • A Python job script using Ray for distributed processing
  • All via a single SLURM batch script

📁 Files Included

File Description
ray_cluster_all_in_one.sh SLURM batch script to launch head, workers, and job
my_ray_script.py Example Ray script with remote GPU tasks
ray_container.def Apptainer definition file to build container

🚀 How It Works

  • Ray head is started on the first SLURM node (no GPU requested).
  • Remaining nodes launch Ray workers with GPU support.
  • A Python script runs on the head node using Ray to distribute work.
  • All processes are launched using srun and managed by SLURM.

⚙️ Prerequisites

  • A SLURM-managed HPC environment
  • Apptainer installed
  • Apptainer container image built from the provided definition file

🛠️ Building the Container

apptainer build ray_container.sif ray_container.def

Update the path in ray_cluster_all_in_one.sh:

CONTAINER_PATH=/full/path/to/ray_container.sif

📦 Submitting the Job

Submit the full cluster workload with:

sbatch ray_cluster_all_in_one.sh

It will:

  1. Start Ray head on node 1
  2. Start workers on remaining nodes
  3. Run my_ray_script.py using Ray
  4. Shut down all Ray processes

🖥️ Ray Dashboard (Optional)

To access the Ray dashboard:

ssh -L 8265:localhost:8265 user@cluster

Then open: http://localhost:8265


🧪 Example Workload: my_ray_script.py

This script connects to the running Ray cluster and distributes 20 square operations across nodes:

@ray.remote(num_gpus=0.25)
def square(x): return x * x

You can replace this with any workload using Ray's APIs.


🔁 Customization

Want to change... Do this
Number of nodes Edit #SBATCH --nodes=X
GPUs per worker Edit --gres=gpu:X and --num-gpus=X
CPUs per task Edit --cpus-per-task=X and --num-cpus=X
Job duration Adjust #SBATCH --time=...
Container content Edit ray_container.def

✅ Example SLURM Cluster Specs (Assumed)

  • Shared /home or /scratch across nodes
  • Passwordless SSH not needed (SLURM handles all srun)
  • GPU-enabled worker nodes (e.g., V100, A100)

📄 License

MIT or your institution's default open-source license.


🙋‍♀️ Questions?

File an issue or contact your HPC support team if you need help with SLURM or Apptainer permissions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published