Skip to content
This repository was archived by the owner on Apr 23, 2025. It is now read-only.

Commit 18103b8

Browse files
author
Jackson Maxfield Brown
authored
docs/ec2-debug-and-manual-cleanup (#240)
* Add docs for manual cleanup and debugging for EC2 instances * Add IAM permissions details * First pass resolving comments * Remove toggle / tab and generalize resource cleanup * Link to cloud-permission-set * Drop userspace logs from debug * Minor grammar fix
1 parent fc6b5fd commit 18103b8

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed

content/docs/self-hosted-runners.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -361,6 +361,39 @@ for obtaining these keys.
361361
☝️ **Note** The same credentials can also be used for
362362
[configuring cloud storage](/doc/cml-with-dvc#cloud-storage-provider-credentials).
363363

364+
The following are the minimum IAM permissions needed for the CML runner to
365+
deploy on EC2:
366+
367+
- `ec2:CreateSecurityGroup` -- _(Firewall and SSH Access Management)_
368+
- `ec2:AuthorizeSecurityGroupEgress`
369+
- `ec2:AuthorizeSecurityGroupIngress`
370+
- `ec2:DescribeSecurityGroups`
371+
- `ec2:DescribeSubnets`
372+
- `ec2:DescribeVpcs`
373+
- `ec2:ImportKeyPair`
374+
- `ec2:DeleteKeyPair`
375+
- `ec2:CreateTags` -- _(General Resource Management)_
376+
- `ec2:RunInstances` -- _(EC2 Instance Management)
377+
- `ec2:DescribeImages`
378+
- `ec2:DescribeInstances`
379+
- `ec2:TerminateInstances`
380+
- `ec2:DescribeSpotInstanceRequests` -- _(Optionally needed for Spot Access)_
381+
- `ec2:RequestSpotInstances`
382+
- `ec2:CancelSpotInstanceRequests`
383+
384+
Outside of this list, you will need to add any extra permissions required
385+
for your process to complete. These extra permissions can either be added
386+
directly to the account used by the `cml runner` or can be specified during
387+
the `cml runnner` command with:
388+
[`--cloud-permission-set`](https://cml.dev/doc/ref/runner#--cloud-permission-set)
389+
390+
For example, if you need S3 read and write data, you may want to add:
391+
392+
- `s3:ListBucket`
393+
- `s3:PutObject`
394+
- `s3:GetObject`
395+
- `s3:DeleteObject`
396+
364397
</tab>
365398
<tab title="Azure">
366399

@@ -391,6 +424,50 @@ provisioned through environment variables instead of files.
391424
</tab>
392425
</toggle>
393426

427+
#### Cloud Compute Resource Manual Cleanup
428+
429+
In very rare cases, you may need to cleanup CML cloud resources manually.
430+
An example of such a problem can be seen
431+
[when an EC2 instance ran out of storage space](https://github.com/iterative/cml/issues/1006).
432+
433+
The following is a list of all the resources you may need to
434+
manually cleanup in the case of a failure:
435+
436+
- The running instance (named with pattern `cml-{random-id}`)
437+
- The volume attached to the running instance
438+
(this should delete itself after terminating the instance)
439+
- The generated key-pair (named with pattern `cml-{random-id}`)
440+
441+
If you keep encountering issues, it is appreciated to attempt pulling the logs
442+
from the running instance before terminating and opening a GitHub Issue.
443+
444+
For easy access and debugging on the `cml runner` instance add:
445+
446+
> `--cloud-startup-script=$(echo 'echo "$(curl https://github.com/'"$GITHUB_ACTOR"'.keys)" >> /home/ubuntu/.ssh/authorized_keys' | base64 -w 0)`
447+
448+
If you encounter an error with the `cml runner` instance retrieving logs
449+
with the following is helpful for diagnosing the issue:
450+
451+
☝️ **Note** Please give your cml.log a visual scan, entries like IP addresses
452+
and git repository names may be present and sensitive in some cases.
453+
454+
```bash
455+
ssh ubuntu@instance_public_ip
456+
sudo journalctl -n all -u cml.service --no-pager > cml.log
457+
sudo dmesg --ctime > system.log
458+
```
459+
460+
You can then copy those logs to your local machine with:
461+
462+
```bash
463+
scp ubuntu@instance_public_ip:~/cml.log .
464+
scp ubuntu@instance_public_ip:~/system.log .
465+
```
466+
467+
There is a chance that the instance could be severely broken if the SSH command
468+
hangs -- if that happens reboot it from the web console and try the commands
469+
again.
470+
394471
#### On-premise (Local) Runners
395472

396473
The `cml runner` command can also be used to manually set up a local machine,

0 commit comments

Comments
 (0)