Skip to content

Commit b9cd04a

Browse files
authored
docs: iter 4 (#532)
1 parent 030c4e4 commit b9cd04a

File tree

5 files changed

+47
-35
lines changed

5 files changed

+47
-35
lines changed

README.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
1010

11-
- **Lower cost with spot recovery**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
11+
- **Lower cost with spot recovery**: transparent data checkpoint/restore & auto-respawning of low-cost spot/preemptible instances
1212
- **No cloud vendor lock-in**: switch between clouds with just one line thanks to unified abstraction
1313
- **No waste**: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
1414
- **Developer-first experience**: one-command data sync & code execution with no external server, making the cloud feel like a laptop
@@ -39,10 +39,12 @@ There are a several reasons to use TPI instead of other related solutions (custo
3939
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline.
4040
2. **Unified tool for data science and software development teams**:
4141
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.
42+
3. **Reproducible, codified environments**:
43+
Store hardware requirements in a single configuration file alongside the rest of your ML pipeline code.
4244

4345
[^scalers]: [AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html), [Azure VM Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets), [GCP managed instance groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups), and [Kubernetes Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job).
4446

45-
<img width=24px src="https://static.iterative.ai/logo/cml.svg"/> TPI is used to power [CML runners](https://cml.dev/doc/self-hosted-runners), bringing cloud providers to existing CI/CD workflows.
47+
<img width=24px src="https://static.iterative.ai/logo/cml.svg"/> TPI is used to power [CML](https://cml.dev), bringing cloud providers to existing GitHub, GitLab & Bitbucket CI/CD workflows ([repository](https://github.com/iterative/cml)).
4648

4749
## Usage
4850

@@ -74,12 +76,12 @@ provider "iterative" {}
7476
resource "iterative_task" "example" {
7577
cloud = "aws" # or any of: gcp, az, k8s
7678
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
77-
spot = 0 # auto-price. Or -1 to disable, or >0 to set a hourly USD limit
79+
spot = 0 # auto-price. Default -1 to disable, or >0 for hourly USD limit
7880
disk_size = 30 # GB
7981
8082
storage {
81-
workdir = "."
82-
output = "results"
83+
workdir = "." # default blank (don't upload)
84+
output = "results" # default blank (don't download). Relative to workdir
8385
}
8486
script = <<-END
8587
#!/bin/bash
@@ -126,7 +128,7 @@ TF_LOG_PROVIDER=INFO terraform refresh
126128
TF_LOG_PROVIDER=INFO terraform show
127129
```
128130

129-
### Stop Task
131+
### End Task
130132

131133
```
132134
TF_LOG_PROVIDER=INFO terraform destroy
@@ -149,16 +151,16 @@ direction LR
149151
B[("Cloud Storage (low cost)")]
150152
C{{"Cloud instance scaler (zero cost)"}}
151153
D[["Cloud (spot) Instance"]]
152-
A ---> |create cloud storage| B
153-
A --> |create cloud instance scaler| C
154-
A ==> |upload script & workdir| B
155-
A -.-> |"offline (lunch break)"| A
156-
C -.-> |"(re)provision instance"| D
157-
D ==> |run script| D
158-
B <-.-> |persistent workdir cache| D
159-
D ==> |script end,\nshutdown instance| B
154+
A ---> |2. create cloud storage| B
155+
A --> |1. create cloud instance scaler| C
156+
A ==> |3. upload script & workdir| B
157+
A -.-> |"4. offline (lunch break)"| A
158+
C -.-> |"5. (re)provision instance"| D
159+
D ==> |7. run script| D
160+
B <-.-> |6. persistent workdir cache| D
161+
D ==> |8. script end,\nshutdown instance| B
160162
D -.-> |outage| C
161-
B ==> |download output| A
163+
B ==> |9. download output| A
162164
end
163165
style you fill:#FFFFFF00,stroke:#13ADC7
164166
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px

docs/guides/generic-machine-types.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ subcategory: Development
77

88
The table below is a more detailed version of the common choices summarised in [Task Machine Types](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type).
99

10-
| Type | [`aws`] | [`az`] | [`gcp`] | [`k8s`] |
10+
| Type | [aws] | [az] | [gcp] | [k8s] |
1111
| :-------- | :------------ | :--------------------- | :---------------------------------------------- | :--------------------------------------------------- |
1212
| `s` | `t2.micro` | `Standard_B1s` | `g1-small` | `cpu: 1`<br>`memory: 1G` |
1313
| `m` | `m5.2xlarge` | `Standard_F8s_v2` | `e2-custom-8-32768` | `cpu: 8`<br>`memory: 32G` |
@@ -21,7 +21,13 @@ The table below is a more detailed version of the common choices summarised in [
2121
| `l+v100` | `p3.8xlarge` | `Standard_NC12s_v3` | `custom-32-262144-ext`<br>4 `nvidia-tesla-v100` | `cpu: 32`<br>`memory: 256G`<br>4 `nvidia-tesla-v100` |
2222
| `xl+v100` | `p3.16xlarge` | `Standard_NC24s_v3` | `custom-64-524288-ext`<br>8 `nvidia-tesla-v100` | `cpu: 64`<br>`memory: 512G`<br>8 `nvidia-tesla-v100` |
2323

24-
[`aws`]: https://aws.amazon.com/ec2/instance-explorer
25-
[`az`]: https://azure.microsoft.com/en-us/pricing/vm-selector
26-
[`gcp`]: https://cloud.google.com/compute/docs/machine-types
27-
[`k8s`]: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers
24+
[aws]: https://aws.amazon.com/ec2/instance-explorer
25+
[az]: https://azure.microsoft.com/en-us/pricing/vm-selector
26+
[gcp]: https://cloud.google.com/compute/docs/machine-types
27+
[k8s]: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers
28+
29+
## Pricing
30+
31+
- aws: [on-demand](https://aws.amazon.com/ec2/pricing), [spot](https://aws.amazon.com/ec2/spot/pricing)
32+
- [az](https://azure.microsoft.com/en-us/pricing/calculator)
33+
- [gcp](https://cloud.google.com/products/calculator)

docs/guides/getting-started.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,12 +37,12 @@ provider "iterative" {}
3737
resource "iterative_task" "example" {
3838
cloud = "aws" # or any of: gcp, az, k8s
3939
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
40-
spot = 0 # auto-price. Or -1 to disable, or >0 to set a hourly USD limit
40+
spot = 0 # auto-price. Default -1 to disable, or >0 for hourly USD limit
4141
disk_size = 30 # GB
4242

4343
storage {
44-
workdir = "."
45-
output = "results"
44+
workdir = "." # default blank (don't upload)
45+
output = "results" # default blank (don't download). Relative to workdir
4646
}
4747
script = <<-END
4848
#!/bin/bash
@@ -96,6 +96,7 @@ This command will:
9696
1. Create all the required cloud resources (provisioning a `machine` with `disk_size` storage).
9797
2. Upload the working directory (`workdir`) to the cloud.
9898
3. Launch the task `script`.
99+
4. Terminate the `machine` on `script` completion/error.
99100

100101
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent (`disk_size`) storage will be used to relaunch interrupted tasks.
101102

@@ -117,7 +118,7 @@ These commands will:
117118
1. Query the task status from the cloud.
118119
2. Display the task status.
119120

120-
## Stop Task
121+
## End Task
121122

122123
```console
123124
$ TF_LOG_PROVIDER=INFO terraform destroy

docs/index.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
99

10-
- **Lower cost with spot recovery**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
10+
- **Lower cost with spot recovery**: transparent data checkpoint/restore & auto-respawning of low-cost spot/preemptible instances
1111
- **No cloud vendor lock-in**: switch between clouds with just one line thanks to unified abstraction
1212
- **No waste**: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
1313
- **Developer-first experience**: one-command data sync & code execution with no external server, making the cloud feel like a laptop
@@ -37,8 +37,10 @@ There are a several reasons to use TPI instead of other related solutions (custo
3737
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups ([AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html), [Azure VM Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets), [GCP managed instance groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups), and [Kubernetes Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job)), taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline.
3838
2. **Unified tool for data science and software development teams**:
3939
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.
40+
3. **Reproducible, codified environments**:
41+
Store hardware requirements in a single configuration file alongside the rest of your ML pipeline code.
4042

41-
<img width=24px src="https://static.iterative.ai/logo/cml.svg"/> TPI is used to power [CML runners](https://cml.dev/doc/self-hosted-runners), bringing cloud providers to existing CI/CD workflows.
43+
<img width=24px src="https://static.iterative.ai/logo/cml.svg"/> TPI is used to power [CML](https://cml.dev), bringing cloud providers to existing GitHub, GitLab & Bitbucket CI/CD workflows ([repository](https://github.com/iterative/cml)).
4244

4345
## Links
4446

docs/resources/task.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,19 +11,19 @@ This resource will:
1111

1212
```hcl
1313
resource "iterative_task" "example" {
14-
cloud = "aws"
14+
cloud = "aws" # or any of: gcp, az, k8s
1515
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
16-
image = "ubuntu"
17-
region = "us-east"
16+
image = "ubuntu" # or "nvidia", ...
17+
region = "us-west" # or "us-east", "eu-west", ...
1818
disk_size = 30 # GB
19-
spot = 0 # auto-price. Or -1 to disable, or >0 to set a hourly USD limit
19+
spot = 0 # auto-price. Default -1 to disable, or >0 for hourly USD limit
2020
parallelism = 1
21-
timeout = 60*60 # max 1h before forced termination
21+
timeout = 24*60*60 # max 24h before forced termination
2222
2323
environment = { GREETING = "Hello, world!" }
2424
storage {
25-
workdir = "."
26-
output = "results"
25+
workdir = "." # default blank (don't upload)
26+
output = "results" # default blank (don't download). Relative to workdir
2727
}
2828
script = <<-END
2929
#!/bin/bash
@@ -105,7 +105,7 @@ The above would allow:
105105
$ terraform output --raw logs
106106
```
107107

108-
Finally, JSON output can be parsed using `terraform output --json` and `jq` like this:
108+
Finally, JSON output can be parsed using `terraform show --json` and `jq` like this:
109109

110110
```console
111111
$ terraform show --json | jq --raw-output '
@@ -169,6 +169,7 @@ In addition to generic types, it's possible to specify any machine type supporte
169169
The Iterative Provider offers some common machine images which are roughly the same for all supported clouds.
170170

171171
- `ubuntu` - Official [Ubuntu LTS](https://wiki.ubuntu.com/LTS) image (currently 20.04).
172+
- `nvidia` - Official [NVIDIA NGC](https://docs.nvidia.com/ngc/ngc-deploy-public-cloud)-based images, typically needing `disk_size = 32` GB or more.
172173

173174
### Cloud-specific
174175

@@ -231,8 +232,8 @@ See https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findima
231232

232233
The Iterative Provider offers some common cloud regions which are roughly the same for all supported clouds.
233234

234-
- `us-east` - United States of America, East.
235235
- `us-west` - United States of America, West.
236+
- `us-east` - United States of America, East.
236237
- `eu-north` - Europe, North.
237238
- `eu-west` - Europe, West.
238239

0 commit comments

Comments
 (0)