From f30a9d2b6971dfe78dd3e850c0038934245f61ab Mon Sep 17 00:00:00 2001 From: JacksonMaxfield Date: Mon, 16 May 2022 13:13:46 -0700 Subject: [PATCH 1/7] Add docs for manual cleanup and debugging for EC2 instances --- content/docs/self-hosted-runners.md | 45 +++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/content/docs/self-hosted-runners.md b/content/docs/self-hosted-runners.md index 1fdc3576..144da8c5 100644 --- a/content/docs/self-hosted-runners.md +++ b/content/docs/self-hosted-runners.md @@ -391,6 +391,51 @@ provisioned through environment variables instead of files. +#### Cloud Compute Resource Manual Cleanup + +In very rare cases, you may need to cleanup CML cloud resources manually. +An example of such a problem can be seen +[when an EC2 instance ran out of storage space](https://github.com/iterative/cml/issues/1006). + +The following sections contain lists of all the resources you may need to +manually cleanup in the case of a failure. + + + + +- The running EC2 instance (named with pattern `cml-{random-id}`) +- The volume attached to the running EC2 instance + (this should delete itself after terminating the EC2 instance) +- The generated key-pair (named with pattern `cml-{random-id}`) + +If you keep encountering issues, it is appreciated to attempt pulling the logs +from the running instance before terminating and opening a GitHub Issue. + +To do so add a startup command to the runner: + +> `--cloud-startup-script=$(echo 'echo "$(curl https://github.com/'"$GITHUB_ACTOR"'.keys)" >> /home/ubuntu/.ssh/authorized_keys' | base64 -w 0)` + +Once the instance fails you can attempt to connect to it and dump logs with: + +```bash +ssh ubuntu@instance_public_ip +sudo journalctl -n all -u cml.service --no-pager > cml.log +sudo dmesg --ctime > system.log +sudo dmesg --ctime --userspace > userspace.log +``` + +You can then copy those logs to your local machine with: + +```bash +scp ubuntu@instance_public_ip:~/cml.log . +scp ubuntu@instance_public_ip:~/system.log . +scp ubuntu@instance_public_ip:~/userspace.log . +``` + +There is a chance that the instance could be severely broken if the SSH command +hangs -- if that happens reboot it from the web console and try the commands +again. + #### On-premise (Local) Runners The `cml runner` command can also be used to manually set up a local machine, From a0d6a15ad67637d63a9f8b02b097b222c79e4b23 Mon Sep 17 00:00:00 2001 From: JacksonMaxfield Date: Mon, 16 May 2022 15:12:04 -0700 Subject: [PATCH 2/7] Add IAM permissions details --- content/docs/self-hosted-runners.md | 30 +++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/content/docs/self-hosted-runners.md b/content/docs/self-hosted-runners.md index 144da8c5..27610ddb 100644 --- a/content/docs/self-hosted-runners.md +++ b/content/docs/self-hosted-runners.md @@ -361,6 +361,36 @@ for obtaining these keys. ☝️ **Note** The same credentials can also be used for [configuring cloud storage](/doc/cml-with-dvc#cloud-storage-provider-credentials). +The following are the minimum IAM permissions needed for the CML runner to +deploy on EC2: + +- `ec2:CreateSecurityGroup` -- _(Firewall and SSH Access Management)_ +- `ec2:AuthorizeSecurityGroupEgress` +- `ec2:AuthorizeSecurityGroupIngress` +- `ec2:DescribeSecurityGroups` +- `ec2:DescribeSubnets` +- `ec2:DescribeVpcs` +- `ec2:ImportKeyPair` +- `ec2:DeleteKeyPair` +- `ec2:CreateTags` -- _(General Resource Management)_ +- `ec2:RunInstances` -- _(EC2 Instance Management) +- `ec2:DescribeImages` +- `ec2:DescribeInstances` +- `ec2:TerminateInstances` +- `ec2:DescribeSpotInstanceRequests` -- _(Optionally needed for Spot Access)_ +- `ec2:RequestSpotInstances` +- `ec2:CancelSpotInstanceRequests` + +Outside of this list, you will need to add any extra permissions required +for your process to complete. + +For example, if you need S3 read and write data, you may want to add: + +- `s3:ListBucket` +- `s3:PutObject` +- `s3:GetObject` +- `s3:DeleteObject` + From 0c99744653af827066352dd16b10abd6ea6c371f Mon Sep 17 00:00:00 2001 From: JacksonMaxfield Date: Mon, 16 May 2022 15:16:08 -0700 Subject: [PATCH 3/7] First pass resolving comments --- content/docs/self-hosted-runners.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/content/docs/self-hosted-runners.md b/content/docs/self-hosted-runners.md index 27610ddb..6ede097f 100644 --- a/content/docs/self-hosted-runners.md +++ b/content/docs/self-hosted-runners.md @@ -441,11 +441,15 @@ manually cleanup in the case of a failure. If you keep encountering issues, it is appreciated to attempt pulling the logs from the running instance before terminating and opening a GitHub Issue. -To do so add a startup command to the runner: +For easy access and debugging on the `cml runner` instance add: > `--cloud-startup-script=$(echo 'echo "$(curl https://github.com/'"$GITHUB_ACTOR"'.keys)" >> /home/ubuntu/.ssh/authorized_keys' | base64 -w 0)` -Once the instance fails you can attempt to connect to it and dump logs with: +If you encounter an error with the `cml runner` instance retrieving logs +with the following is helpful for diagnosing the issue: + +☝️ **Note** Please give your cml.log a visual scan, entries like IP addresses +and git repository names may be present and sensitive in some cases. ```bash ssh ubuntu@instance_public_ip @@ -466,6 +470,9 @@ There is a chance that the instance could be severely broken if the SSH command hangs -- if that happens reboot it from the web console and try the commands again. + + + #### On-premise (Local) Runners The `cml runner` command can also be used to manually set up a local machine, From 920163545f5fe7affe6a694dc4a29c27f25a3b74 Mon Sep 17 00:00:00 2001 From: JacksonMaxfield Date: Mon, 16 May 2022 15:40:23 -0700 Subject: [PATCH 4/7] Remove toggle / tab and generalize resource cleanup --- content/docs/self-hosted-runners.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/content/docs/self-hosted-runners.md b/content/docs/self-hosted-runners.md index 6ede097f..80deb282 100644 --- a/content/docs/self-hosted-runners.md +++ b/content/docs/self-hosted-runners.md @@ -430,12 +430,9 @@ An example of such a problem can be seen The following sections contain lists of all the resources you may need to manually cleanup in the case of a failure. - - - -- The running EC2 instance (named with pattern `cml-{random-id}`) -- The volume attached to the running EC2 instance - (this should delete itself after terminating the EC2 instance) +- The running instance (named with pattern `cml-{random-id}`) +- The volume attached to the running instance + (this should delete itself after terminating the instance) - The generated key-pair (named with pattern `cml-{random-id}`) If you keep encountering issues, it is appreciated to attempt pulling the logs @@ -470,9 +467,6 @@ There is a chance that the instance could be severely broken if the SSH command hangs -- if that happens reboot it from the web console and try the commands again. - - - #### On-premise (Local) Runners The `cml runner` command can also be used to manually set up a local machine, From a7b3c1b65a5b6f2c2f9229c187682e91370bc2b9 Mon Sep 17 00:00:00 2001 From: JacksonMaxfield Date: Mon, 16 May 2022 16:03:03 -0700 Subject: [PATCH 5/7] Link to cloud-permission-set --- content/docs/self-hosted-runners.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/content/docs/self-hosted-runners.md b/content/docs/self-hosted-runners.md index 80deb282..29e84330 100644 --- a/content/docs/self-hosted-runners.md +++ b/content/docs/self-hosted-runners.md @@ -382,7 +382,10 @@ deploy on EC2: - `ec2:CancelSpotInstanceRequests` Outside of this list, you will need to add any extra permissions required -for your process to complete. +for your process to complete. These extra permissions can either be added +directly to the account used by the `cml runner` or can be specified during +the `cml runnner` command with: +[`--cloud-permission-set`](https://cml.dev/doc/ref/runner#--cloud-permission-set) For example, if you need S3 read and write data, you may want to add: From bbe7e1ad2ee2d6a7d5ee8ca3b92eae8820ada215 Mon Sep 17 00:00:00 2001 From: JacksonMaxfield Date: Mon, 16 May 2022 21:43:01 -0700 Subject: [PATCH 6/7] Drop userspace logs from debug --- content/docs/self-hosted-runners.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/content/docs/self-hosted-runners.md b/content/docs/self-hosted-runners.md index 29e84330..45ff9ca2 100644 --- a/content/docs/self-hosted-runners.md +++ b/content/docs/self-hosted-runners.md @@ -455,7 +455,6 @@ and git repository names may be present and sensitive in some cases. ssh ubuntu@instance_public_ip sudo journalctl -n all -u cml.service --no-pager > cml.log sudo dmesg --ctime > system.log -sudo dmesg --ctime --userspace > userspace.log ``` You can then copy those logs to your local machine with: @@ -463,7 +462,6 @@ You can then copy those logs to your local machine with: ```bash scp ubuntu@instance_public_ip:~/cml.log . scp ubuntu@instance_public_ip:~/system.log . -scp ubuntu@instance_public_ip:~/userspace.log . ``` There is a chance that the instance could be severely broken if the SSH command From a3f530ff539ee5b1b1798d2655b93f8d0660637e Mon Sep 17 00:00:00 2001 From: JacksonMaxfield Date: Tue, 17 May 2022 10:28:38 -0700 Subject: [PATCH 7/7] Minor grammar fix --- content/docs/self-hosted-runners.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/self-hosted-runners.md b/content/docs/self-hosted-runners.md index 45ff9ca2..6e8cebef 100644 --- a/content/docs/self-hosted-runners.md +++ b/content/docs/self-hosted-runners.md @@ -430,8 +430,8 @@ In very rare cases, you may need to cleanup CML cloud resources manually. An example of such a problem can be seen [when an EC2 instance ran out of storage space](https://github.com/iterative/cml/issues/1006). -The following sections contain lists of all the resources you may need to -manually cleanup in the case of a failure. +The following is a list of all the resources you may need to +manually cleanup in the case of a failure: - The running instance (named with pattern `cml-{random-id}`) - The volume attached to the running instance