Skip to content

Commit 735c80e

Browse files
authored
Add instructions for running examples on Ray cluster on AWS (#769)
1 parent 3352974 commit 735c80e

File tree

5 files changed

+266
-1
lines changed

5 files changed

+266
-1
lines changed

docs/examples/how-to-run.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,4 @@ cd lithops/aws # or whichever executor/cloud combination you are using
2626
| | Google | [modal/gcp/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/modal/gcp/README.md) |
2727
| Coiled | AWS | [coiled/aws/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/coiled/aws/README.md) |
2828
| Beam | Google | [dataflow/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/dataflow/README.md) |
29+
| Ray | AWS | [ray/aws/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/ray/aws/README.md) |

docs/user-guide/executors.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@ If your data is in Amazon S3 then use Lithops with AWS Lambda, and if it's in GC
2626

2727
[**Google Cloud Dataflow**](https://cloud.google.com/dataflow) is relatively straightforward to get started with. It has the highest overhead for worker startup (minutes compared to seconds for Modal or Lithops), and although it has only been tested with ~20 workers, it is a mature service and therefore should be reliable for much larger computations.
2828

29-
We have **experimental** executors for [**Ray**](https://www.ray.io/) (see [#488](https://github.com/cubed-dev/cubed/issues/488)), [**Globus Compute**](https://www.globus.org/) (see [#689](https://github.com/cubed-dev/cubed/pull/689)), and [**Apache Spark**](https://spark.apache.org/) (see [#499](https://github.com/cubed-dev/cubed/issues/499)). These have not had much testing, so we'd be very interested in feedback if you try them out.
29+
[**Ray**](https://www.ray.io/) is a popular way to scale Python and AI applications. You can run Cubed computations on a Ray cluster.
30+
31+
We have **experimental** executors for [**Globus Compute**](https://www.globus.org/) (see [#689](https://github.com/cubed-dev/cubed/pull/689)) and [**Apache Spark**](https://spark.apache.org/) (see [#499](https://github.com/cubed-dev/cubed/issues/499)). These have not had much testing, so we'd be very interested in feedback if you try them out.
3032

3133
## Specifying an executor
3234

examples/ray/aws/README.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Examples running Cubed on Ray (AWS)
2+
3+
## Pre-requisites
4+
5+
1. An AWS account
6+
7+
## Start Ray Cluster
8+
9+
Use Ray's Cluster Launcher to launch a Ray cluster on EC2, following https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/aws.html.
10+
11+
1. Install a Python environment by running the following from this directory:
12+
13+
```shell
14+
conda create --name ray-aws-cluster -y python=3.12
15+
conda activate ray-aws-cluster
16+
pip install "ray[default]" boto3
17+
```
18+
19+
2. Start the Ray cluster (after possibly adjusting `config.yaml` settings for region and number of workers)
20+
21+
```shell
22+
ray up -y config.yaml
23+
ray dashboard config.yaml # (optional) useful way to port forward to view dashboard on http://localhost:8265/
24+
```
25+
26+
## Set up
27+
28+
1. Create a new S3 bucket (called `cubed-<username>-temp`, for example) in the same region you started the Ray cluster in. This will be used for intermediate zarr data.
29+
30+
2. Attach to the head node
31+
32+
```shell
33+
ray attach config.yaml
34+
```
35+
36+
3. Clone the Cubed repository
37+
38+
```shell
39+
git clone https://github.com/cubed-dev/cubed
40+
cd cubed/examples
41+
pip install "cubed[diagnostics]" s3fs
42+
export CUBED_CONFIG=$(pwd)/ray/aws
43+
export USER=...
44+
```
45+
46+
4. Set environment variables for AWS credentials so they are available on the workers for S3 access.
47+
48+
```shell
49+
export AWS_ACCESS_KEY_ID=...
50+
export AWS_SECRET_ACCESS_KEY=...
51+
```
52+
53+
Note that there is another way to do this described in the Ray documentation: https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/aws.html#accessing-s3.
54+
55+
## Examples
56+
57+
Now run the examples in the [docs](https://cubed-dev.github.io/cubed/examples/index.html).
58+
59+
## Shutdown Ray Cluster
60+
61+
Don't forget to shutdown the Ray cluster:
62+
63+
```shell
64+
ray down -y config.yaml
65+
```

examples/ray/aws/config.yaml

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# Based on https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml
2+
3+
# An unique identifier for the head node and workers of this cluster.
4+
cluster_name: cubed-ray-cluster
5+
6+
# The maximum number of workers nodes to launch in addition to the head
7+
# node.
8+
max_workers: 1
9+
10+
# The autoscaler will scale up the cluster faster with higher upscaling speed.
11+
# E.g., if the task requires adding more nodes then autoscaler will gradually
12+
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
13+
# This number should be > 0.
14+
upscaling_speed: 1.0
15+
16+
# This executes all commands on all nodes in the docker container,
17+
# and opens all the necessary ports to support the Ray cluster.
18+
# Empty string means disabled.
19+
docker:
20+
# image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
21+
image: rayproject/ray:latest-py312-cpu # use this one if you don't need ML dependencies, it's faster to pull
22+
container_name: "ray_container"
23+
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
24+
# if no cached version is present.
25+
pull_before_run: True
26+
run_options: # Extra options to pass into "docker run"
27+
- --ulimit nofile=65536:65536
28+
29+
# Example of running a GPU head with CPU workers
30+
# head_image: "rayproject/ray-ml:latest-gpu"
31+
# Allow Ray to automatically detect GPUs
32+
33+
# worker_image: "rayproject/ray-ml:latest-cpu"
34+
# worker_run_options: []
35+
36+
# If a node is idle for this many minutes, it will be removed.
37+
idle_timeout_minutes: 5
38+
39+
# Cloud-provider specific configuration.
40+
provider:
41+
type: aws
42+
region: eu-west-1
43+
# Availability zone(s), comma-separated, that nodes may be launched in.
44+
# Nodes will be launched in the first listed availability zone and will
45+
# be tried in the subsequent availability zones if launching fails.
46+
availability_zone: eu-west-1a,eu-west-1b
47+
# Whether to allow node reuse. If set to False, nodes will be terminated
48+
# instead of stopped.
49+
cache_stopped_nodes: False # If not present, the default is True.
50+
51+
# How Ray will authenticate with newly launched nodes.
52+
auth:
53+
ssh_user: ubuntu
54+
# By default Ray creates a new private keypair, but you can also use your own.
55+
# If you do so, make sure to also set "KeyName" in the head and worker node
56+
# configurations below.
57+
# ssh_private_key: /path/to/your/key.pem
58+
59+
# Tell the autoscaler the allowed node types and the resources they provide.
60+
# The key is the name of the node type, which is just for debugging purposes.
61+
# The node config specifies the launch config and physical instance type.
62+
available_node_types:
63+
ray.head.default:
64+
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
65+
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
66+
# You can also set custom resources.
67+
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
68+
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
69+
resources: {}
70+
# Provider-specific config for this node type, e.g. instance type. By default
71+
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
72+
# For more documentation on available fields, see:
73+
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
74+
node_config:
75+
InstanceType: m5.large
76+
# Default AMI for us-west-2.
77+
# Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
78+
# for default images for other zones.
79+
# ImageId: ami-0387d929287ab193e
80+
# You can provision additional disk space with a conf as follows
81+
BlockDeviceMappings:
82+
- DeviceName: /dev/sda1
83+
Ebs:
84+
VolumeSize: 140
85+
VolumeType: gp3
86+
# Additional options in the boto docs.
87+
ray.worker.default:
88+
# The minimum number of worker nodes of this type to launch.
89+
# This number should be >= 0.
90+
min_workers: 1
91+
# The maximum number of worker nodes of this type to launch.
92+
# This takes precedence over min_workers.
93+
max_workers: 1
94+
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
95+
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
96+
# You can also set custom resources.
97+
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
98+
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
99+
resources: {}
100+
# Provider-specific config for this node type, e.g. instance type. By default
101+
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
102+
# For more documentation on available fields, see:
103+
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
104+
node_config:
105+
InstanceType: m5.large
106+
# Default AMI for us-west-2.
107+
# Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
108+
# for default images for other zones.
109+
# ImageId: ami-0387d929287ab193e
110+
# Run workers on spot by default. Comment this out to use on-demand.
111+
# NOTE: If relying on spot instances, it is best to specify multiple different instance
112+
# types to avoid interruption when one instance type is experiencing heightened demand.
113+
# Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
114+
#InstanceMarketOptions:
115+
#MarketType: spot
116+
# Additional options can be found in the boto docs, e.g.
117+
# SpotOptions:
118+
# MaxPrice: MAX_HOURLY_PRICE
119+
# Additional options in the boto docs.
120+
BlockDeviceMappings:
121+
- DeviceName: /dev/sda1
122+
Ebs:
123+
VolumeSize: 140
124+
VolumeType: gp3
125+
126+
# Specify the node type of the head node (as configured above).
127+
head_node_type: ray.head.default
128+
129+
# Files or directories to copy to the head and worker nodes. The format is a
130+
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
131+
file_mounts: {
132+
# "/path1/on/remote/machine": "/path1/on/local/machine",
133+
# "/path2/on/remote/machine": "/path2/on/local/machine",
134+
}
135+
136+
# Files or directories to copy from the head node to the worker nodes. The format is a
137+
# list of paths. The same path on the head node will be copied to the worker node.
138+
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
139+
# you should just use file_mounts. Only use this if you know what you're doing!
140+
cluster_synced_files: []
141+
142+
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
143+
# should sync to the worker node continuously
144+
file_mounts_sync_continuously: False
145+
146+
# Patterns for files to exclude when running rsync up or rsync down
147+
rsync_exclude:
148+
- "**/.git"
149+
- "**/.git/**"
150+
151+
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
152+
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
153+
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
154+
rsync_filter:
155+
- ".gitignore"
156+
157+
# List of commands that will be run before `setup_commands`. If docker is
158+
# enabled, these commands will run outside the container and before docker
159+
# is setup.
160+
initialization_commands: []
161+
162+
# List of shell commands to run to set up nodes.
163+
setup_commands: []
164+
# Note: if you're developing Ray, you probably want to create a Docker image that
165+
# has your Ray repo pre-cloned. Then, you can replace the pip installs
166+
# below with a git checkout <your_sha> (and possibly a recompile).
167+
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
168+
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
169+
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
170+
171+
# Custom commands that will be run on the head node after common setup.
172+
head_setup_commands: []
173+
174+
# Custom commands that will be run on worker nodes after common setup.
175+
worker_setup_commands: []
176+
177+
# Command to start ray on the head node. You don't need to change this.
178+
head_start_ray_commands:
179+
- ray stop
180+
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
181+
182+
# Command to start ray on worker nodes. You don't need to change this.
183+
worker_start_ray_commands:
184+
- ray stop
185+
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

examples/ray/aws/cubed.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
spec:
2+
work_dir: "s3://cubed-$USER-temp"
3+
allowed_mem: "2GB"
4+
executor_name: "ray"
5+
executor_options:
6+
ray_init:
7+
runtime_env:
8+
pip: ["cubed", "s3fs"]
9+
env_vars:
10+
AWS_ACCESS_KEY_ID: "$AWS_ACCESS_KEY_ID"
11+
AWS_SECRET_ACCESS_KEY: "$AWS_SECRET_ACCESS_KEY"
12+
ignore_reinit_error: True

0 commit comments

Comments
 (0)